Foundations of Statistical Inference

Similar documents
PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

Foundations of Statistical Inference

A Very Brief Summary of Bayesian Inference, and Examples

Course arrangements. Foundations of Statistical Inference

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Part 2: One-parameter models

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Data Analysis and Uncertainty Part 2: Estimation

Introduction to Applied Bayesian Modeling. ICPSR Day 4

9 Bayesian inference. 9.1 Subjective probability

Foundations of Statistical Inference

Chapter 8: Sampling distributions of estimators Sections

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

A Very Brief Summary of Statistical Inference, and Examples

Computational Perception. Bayesian Inference

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Bernoulli and Poisson models

Bayesian Regression Linear and Logistic Regression

Bayesian Statistics Part III: Building Bayes Theorem Part IV: Prior Specification

Introduction to Bayesian Methods

Foundations of Statistical Inference

ST 740: Multiparameter Inference

Bayesian Models in Machine Learning

Lecture 2: Priors and Conjugacy

Introduction to Probabilistic Machine Learning

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Lecture 2: Conjugate priors

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Statistical Inference: Maximum Likelihood and Bayesian Approaches


Topic 12 Overview of Estimation

Bayesian statistics, simulation and software

Bayesian Inference. p(y)

Chapter 5. Bayesian Statistics

Unobservable Parameter. Observed Random Sample. Calculate Posterior. Choosing Prior. Conjugate prior. population proportion, p prior:

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Statistical Theory MT 2007 Problems 4: Solution sketches

Bayesian Statistics. Debdeep Pati Florida State University. February 11, 2016

Conditional distributions

Part III. A Decision-Theoretic Approach and Bayesian testing

Bayesian Inference: Posterior Intervals

Part 3 Robust Bayesian statistics & applications in reliability networks

STAT 425: Introduction to Bayesian Analysis


Inference for a Population Proportion

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

Stat 5102 Final Exam May 14, 2015

Bayesian statistics, simulation and software

Lecture 2: Statistical Decision Theory (Part I)

CS 361: Probability & Statistics

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

COS513 LECTURE 8 STATISTICAL CONCEPTS

CSC321 Lecture 18: Learning Probabilistic Models

STAT 830 Bayesian Estimation

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Basics on Probability. Jingrui He 09/11/2007

Statistical Theory MT 2006 Problems 4: Solution sketches

Bayesian statistics. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

HPD Intervals / Regions

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

1 Introduction. P (n = 1 red ball drawn) =

Statistical Data Analysis Stat 3: p-values, parameter estimation

Error analysis for efficiency

General Bayesian Inference I

Bayes Factors for Grouped Data

CS 361: Probability & Statistics

Introduction to Probability and Statistics (Continued)

Statistics: Learning models from data

Stat 451 Lecture Notes Numerical Integration

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Lecture 7 October 13

Beta statistics. Keywords. Bayes theorem. Bayes rule

(4) One-parameter models - Beta/binomial. ST440/550: Applied Bayesian Statistics

A Discussion of the Bayesian Approach

Boxplots A boxplot, or box-and-whisker plot, is a convenient way of summarising data, particularly when the data is made up of several groups.

1 Hypothesis Testing and Model Selection

Bayesian inference: what it means and why we care

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

STAT J535: Chapter 5: Classes of Bayesian Priors

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

One-parameter models

An Introduction to Bayesian Linear Regression

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Linear Models A linear model is defined by the expression

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Bayesian Inference: Concept and Practice

Lecture 1: Introduction

Confidence Intervals. First ICFA Instrumentation School/Workshop. Harrison B. Prosper Florida State University

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Bayesian Inference for Normal Mean

Introduction to Machine Learning

Final Examination. STA 215: Statistical Inference. Saturday, 2001 May 5, 9:00am 12:00 noon

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Stat 5101 Lecture Notes

Statistics for scientists and engineers

Transcription:

Foundations of Statistical Inference Julien Berestycki Department of Statistics University of Oxford MT 2016 Julien Berestycki (University of Oxford) SB2a MT 2016 1 / 20

Lecture 6 : Bayesian Inference Julien Berestycki (University of Oxford) SB2a MT 2016 2 / 20

Ideas of probability The majority of statistics you have learned so are are called classical or Frequentist. The probability for an event is defined as the proportion of successes in an infinite number of repeatable trials. Julien Berestycki (University of Oxford) SB2a MT 2016 3 / 20

Ideas of probability The majority of statistics you have learned so are are called classical or Frequentist. The probability for an event is defined as the proportion of successes in an infinite number of repeatable trials. By contrast, in Subjective Bayesian inference, probability is a measure of the strength of belief. Julien Berestycki (University of Oxford) SB2a MT 2016 3 / 20

Ideas of probability The majority of statistics you have learned so are are called classical or Frequentist. The probability for an event is defined as the proportion of successes in an infinite number of repeatable trials. By contrast, in Subjective Bayesian inference, probability is a measure of the strength of belief. We treat parameters as random variables. Before collecting any data we assume that there is uncertainty about the value of a parameter. This uncertainty can be formalised by specifying a pdf (or pmf) for the parameter. We then conduct an experiment to collect some data that will give us information about the parameter. We then use Bayes Theorem to combine our prior beliefs with the data to derive an updated estimate of our uncertainty about the parameter. Julien Berestycki (University of Oxford) SB2a MT 2016 3 / 20

The history of Bayesian Statistics Bayesian methods originated with Bayes and Laplace (late 1700s to mid 1800s). Julien Berestycki (University of Oxford) SB2a MT 2016 4 / 20

The history of Bayesian Statistics Bayesian methods originated with Bayes and Laplace (late 1700s to mid 1800s). In the early 1920 s, Fisher put forward an opposing viewpoint, that statistical inference must be based entirely on probabilities with direct experimental interpretation i.e. the repeated sampling principle. Julien Berestycki (University of Oxford) SB2a MT 2016 4 / 20

The history of Bayesian Statistics Bayesian methods originated with Bayes and Laplace (late 1700s to mid 1800s). In the early 1920 s, Fisher put forward an opposing viewpoint, that statistical inference must be based entirely on probabilities with direct experimental interpretation i.e. the repeated sampling principle. In 1939 Jeffrey s book The theory of probability started a resurgence of interest in Bayesian inference. Julien Berestycki (University of Oxford) SB2a MT 2016 4 / 20

The history of Bayesian Statistics Bayesian methods originated with Bayes and Laplace (late 1700s to mid 1800s). In the early 1920 s, Fisher put forward an opposing viewpoint, that statistical inference must be based entirely on probabilities with direct experimental interpretation i.e. the repeated sampling principle. In 1939 Jeffrey s book The theory of probability started a resurgence of interest in Bayesian inference. This continued throughout the 1950-60s, especially as problems with the Frequentist approach started to emerge. Julien Berestycki (University of Oxford) SB2a MT 2016 4 / 20

The history of Bayesian Statistics Bayesian methods originated with Bayes and Laplace (late 1700s to mid 1800s). In the early 1920 s, Fisher put forward an opposing viewpoint, that statistical inference must be based entirely on probabilities with direct experimental interpretation i.e. the repeated sampling principle. In 1939 Jeffrey s book The theory of probability started a resurgence of interest in Bayesian inference. This continued throughout the 1950-60s, especially as problems with the Frequentist approach started to emerge. The development of simulation based inference has transformed Bayesian statistics in the last 20-30 years and it now plays a prominent part in modern statistics. Julien Berestycki (University of Oxford) SB2a MT 2016 4 / 20

Julien Berestycki (University of Oxford) SB2a MT 2016 5 / 20

Problems with Frequentist Inference - I Frequentist Inference generally does not condition on the observed data Julien Berestycki (University of Oxford) SB2a MT 2016 6 / 20

Problems with Frequentist Inference - I Frequentist Inference generally does not condition on the observed data A confidence interval is a set-valued function C(X) Θ of the data X which covers the parameter θ C(X) a fraction 1 α of repeated draws of X taken under the null H 0. Julien Berestycki (University of Oxford) SB2a MT 2016 6 / 20

Problems with Frequentist Inference - I Frequentist Inference generally does not condition on the observed data A confidence interval is a set-valued function C(X) Θ of the data X which covers the parameter θ C(X) a fraction 1 α of repeated draws of X taken under the null H 0. This is not the same as the statement that, given data X = x, the interval C(x) covers θ with probability 1 α. But this is the type of statement we might wish to make. (observe that this statement makes sense iff θ is a r.v.) Julien Berestycki (University of Oxford) SB2a MT 2016 6 / 20

Problems with Frequentist Inference - I Frequentist Inference generally does not condition on the observed data A confidence interval is a set-valued function C(X) Θ of the data X which covers the parameter θ C(X) a fraction 1 α of repeated draws of X taken under the null H 0. This is not the same as the statement that, given data X = x, the interval C(x) covers θ with probability 1 α. But this is the type of statement we might wish to make. (observe that this statement makes sense iff θ is a r.v.) Example 1 Suppose X 1, X 2 U(θ 1/2, θ + 1/2) so that X (1) and X (2) are the order statistics. Then C(X) = [X (1), X (2) ] is a α = 1/2 level CI for θ. Suppose in your data X = x, x (2) x (1) > 1/2 (this happens in an eighth of data sets). Then θ [x (1), x (2) ] with probability one. Julien Berestycki (University of Oxford) SB2a MT 2016 6 / 20

Problems with Frequentist Inference - II Frequentist Inference depends on data that were never observed Julien Berestycki (University of Oxford) SB2a MT 2016 7 / 20

Problems with Frequentist Inference - II Frequentist Inference depends on data that were never observed The likelihood principle Suppose that two experiments relating to θ, E 1, E 2, give rise to data y 1, y 2, such that the corresponding likelihoods are proportional, that is, for all θ L(θ; y 1, E 1 ) = cl(θ; y 2, E 2 ). then the two experiments lead to identical conclusions about θ. Julien Berestycki (University of Oxford) SB2a MT 2016 7 / 20

Problems with Frequentist Inference - II Frequentist Inference depends on data that were never observed The likelihood principle Suppose that two experiments relating to θ, E 1, E 2, give rise to data y 1, y 2, such that the corresponding likelihoods are proportional, that is, for all θ L(θ; y 1, E 1 ) = cl(θ; y 2, E 2 ). then the two experiments lead to identical conclusions about θ. Key point MLE s respect the likelihood principle i.e. the MLEs for θ are identical in both experiments. But significance tests do not respect the likelihood principle. Julien Berestycki (University of Oxford) SB2a MT 2016 7 / 20

Consider a pure test for significance where we specify just H 0. We must choose a test statistic T (x), and define the p-value for data T (x) = t as p-value = P(T (X) at least as extreme as t H 0 ). The choice of T (X) amounts to a statement about the direction of likely departures from the null, which requires some consideration of alternative models. Julien Berestycki (University of Oxford) SB2a MT 2016 8 / 20

Consider a pure test for significance where we specify just H 0. We must choose a test statistic T (x), and define the p-value for data T (x) = t as p-value = P(T (X) at least as extreme as t H 0 ). The choice of T (X) amounts to a statement about the direction of likely departures from the null, which requires some consideration of alternative models. Note 1 The calculation of the p-value involves a sum (or integral) over data that was not observed, and this can depend upon the form of the experiment. Julien Berestycki (University of Oxford) SB2a MT 2016 8 / 20

Consider a pure test for significance where we specify just H 0. We must choose a test statistic T (x), and define the p-value for data T (x) = t as p-value = P(T (X) at least as extreme as t H 0 ). The choice of T (X) amounts to a statement about the direction of likely departures from the null, which requires some consideration of alternative models. Note 1 The calculation of the p-value involves a sum (or integral) over data that was not observed, and this can depend upon the form of the experiment. Note 2 A p-value is not P(H 0 T (X) = t). Julien Berestycki (University of Oxford) SB2a MT 2016 8 / 20

Example A Bernoulli trial succeeds with probability p. Julien Berestycki (University of Oxford) SB2a MT 2016 9 / 20

Example A Bernoulli trial succeeds with probability p. E 1 fix n 1 Bernoulli trials, count number y 1 of successes Julien Berestycki (University of Oxford) SB2a MT 2016 9 / 20

Example A Bernoulli trial succeeds with probability p. E 1 E 2 fix n 1 Bernoulli trials, count number y 1 of successes count number n 2 Bernoulli trials to get fixed number y 2 successes Julien Berestycki (University of Oxford) SB2a MT 2016 9 / 20

Example A Bernoulli trial succeeds with probability p. E 1 fix n 1 Bernoulli trials, count number y 1 of successes E 2 count number n 2 Bernoulli trials to get fixed number y 2 successes L(p; y 1, E 1 ) = ( n1 y 1 ) p y 1 (1 p) n 1 y 1 binomial Julien Berestycki (University of Oxford) SB2a MT 2016 9 / 20

Example A Bernoulli trial succeeds with probability p. E 1 E 2 fix n 1 Bernoulli trials, count number y 1 of successes count number n 2 Bernoulli trials to get fixed number y 2 successes L(p; y 1, E 1 ) = L(p; n 2, E 2 ) = ( n1 ) p y 1 (1 p) n 1 y 1 binomial y ( 1 ) n2 1 p y 2 (1 p) n 2 y 2 negative binomial y 2 1 Julien Berestycki (University of Oxford) SB2a MT 2016 9 / 20

Example A Bernoulli trial succeeds with probability p. E 1 E 2 fix n 1 Bernoulli trials, count number y 1 of successes count number n 2 Bernoulli trials to get fixed number y 2 successes L(p; y 1, E 1 ) = L(p; n 2, E 2 ) = ( n1 ) p y 1 (1 p) n 1 y 1 binomial y ( 1 ) n2 1 p y 2 (1 p) n 2 y 2 negative binomial y 2 1 If n 1 = n 2 = n, y 1 = y 2 = y then L(p; y 1, E 1 ) L(p; n 2, E 2 ). Julien Berestycki (University of Oxford) SB2a MT 2016 9 / 20

Example A Bernoulli trial succeeds with probability p. E 1 E 2 fix n 1 Bernoulli trials, count number y 1 of successes count number n 2 Bernoulli trials to get fixed number y 2 successes L(p; y 1, E 1 ) = L(p; n 2, E 2 ) = ( n1 ) p y 1 (1 p) n 1 y 1 binomial y ( 1 ) n2 1 p y 2 (1 p) n 2 y 2 negative binomial y 2 1 If n 1 = n 2 = n, y 1 = y 2 = y then L(p; y 1, E 1 ) L(p; n 2, E 2 ). So MLEs for p will be the same under E 1 and E 2. Julien Berestycki (University of Oxford) SB2a MT 2016 9 / 20

Example But significance tests contradict : eg, H 0 : p = 1/2 against H 1 : p < 1/2 and suppose n = 12 and y = 3. Julien Berestycki (University of Oxford) SB2a MT 2016 10 / 20

Example But significance tests contradict : eg, H 0 : p = 1/2 against H 1 : p < 1/2 and suppose n = 12 and y = 3. The p-value based on E 1 is ( P Y y θ = 1 ) y ( ) n = (1/2) k (1 1/2) n k (= 0.073) 2 k k=0 Julien Berestycki (University of Oxford) SB2a MT 2016 10 / 20

Example But significance tests contradict : eg, H 0 : p = 1/2 against H 1 : p < 1/2 and suppose n = 12 and y = 3. The p-value based on E 1 is ( P Y y θ = 1 ) y ( ) n = (1/2) k (1 1/2) n k (= 0.073) 2 k k=0 while the p-value based on E 2 is ( P N n θ = 1 ) ( ) k 1 = (1/2) k (1 1/2) n k (= 0.033) 2 y 1 k=n so different conclusions at significance level 0.05. Julien Berestycki (University of Oxford) SB2a MT 2016 10 / 20

Example But significance tests contradict : eg, H 0 : p = 1/2 against H 1 : p < 1/2 and suppose n = 12 and y = 3. The p-value based on E 1 is ( P Y y θ = 1 ) y ( ) n = (1/2) k (1 1/2) n k (= 0.073) 2 k k=0 while the p-value based on E 2 is ( P N n θ = 1 ) ( ) k 1 = (1/2) k (1 1/2) n k (= 0.033) 2 y 1 k=n so different conclusions at significance level 0.05. Note The p-values disagree because they sum over portions of two different sample spaces. Julien Berestycki (University of Oxford) SB2a MT 2016 10 / 20

Julien Berestycki (University of Oxford) SB2a MT 2016 11 / 20

Bayesian Inference - Revison Likelihood f (x θ) and prior distribution π(θ) for ϑ Julien Berestycki (University of Oxford) SB2a MT 2016 12 / 20

Bayesian Inference - Revison Likelihood f (x θ) and prior distribution π(θ) for ϑ. The posterior distribution of ϑ at ϑ = θ, given x, is π(θ x) = f (x θ)π(θ) π(θ x) f (x θ)π(θ) f (x θ)π(θ)dθ Julien Berestycki (University of Oxford) SB2a MT 2016 12 / 20

Bayesian Inference - Revison Likelihood f (x θ) and prior distribution π(θ) for ϑ. The posterior distribution of ϑ at ϑ = θ, given x, is π(θ x) = f (x θ)π(θ) π(θ x) f (x θ)π(θ) f (x θ)π(θ)dθ posterior likelihood prior The same form for θ continuous (π(θ x) a pdf) or discrete (π(θ x) a pmf). We call f (x θ)π(θ)dθ the marginal likelihood. Julien Berestycki (University of Oxford) SB2a MT 2016 12 / 20

Bayesian Inference - Revison Likelihood f (x θ) and prior distribution π(θ) for ϑ. The posterior distribution of ϑ at ϑ = θ, given x, is π(θ x) = f (x θ)π(θ) π(θ x) f (x θ)π(θ) f (x θ)π(θ)dθ posterior likelihood prior The same form for θ continuous (π(θ x) a pdf) or discrete (π(θ x) a pmf). We call f (x θ)π(θ)dθ the marginal likelihood. Likelihood principle Notice that, if we base all inference on the posterior distribution, then we respect the likelihood principle. If two likelihood functions are proportional, then any constant cancels top and bottom in Bayes rule, and the two posterior distributions are the same. Julien Berestycki (University of Oxford) SB2a MT 2016 12 / 20

Example 1 X Bin(n, ϑ) for known n and unknown ϑ Julien Berestycki (University of Oxford) SB2a MT 2016 13 / 20

Example 1 X Bin(n, ϑ) for known n and unknown ϑ. Suppose our prior knowledge about ϑ is represented by a Beta distribution on (0, 1), and θ is a trial value for ϑ. π(θ) = θa 1 (1 θ) b 1, 0 < θ < 1. B(a, b) Julien Berestycki (University of Oxford) SB2a MT 2016 13 / 20

Example 1 X Bin(n, ϑ) for known n and unknown ϑ. Suppose our prior knowledge about ϑ is represented by a Beta distribution on (0, 1), and θ is a trial value for ϑ. π(θ) = θa 1 (1 θ) b 1, 0 < θ < 1. B(a, b) density 0 1 2 3 4 5 a=1, b=1 a=20, b=20 a=1, b=3 a=8, b=2 0.0 0.2 0.4 0.6 0.8 1.0 Julien Berestycki (University of Oxford) SB2a MT 2016 13 / 20

Example 1 Prior probability density π(θ) = θa 1 (1 θ) b 1, 0 < θ < 1. B(a, b) Julien Berestycki (University of Oxford) SB2a MT 2016 14 / 20

Example 1 Prior probability density Likelihood π(θ) = θa 1 (1 θ) b 1, 0 < θ < 1. B(a, b) f (x θ) = ( ) n θ x (1 θ) n x, x = 0,..., n x Julien Berestycki (University of Oxford) SB2a MT 2016 14 / 20

Example 1 Prior probability density π(θ) = θa 1 (1 θ) b 1, 0 < θ < 1. B(a, b) Likelihood ( ) n f (x θ) = θ x (1 θ) n x, x = 0,..., n x Posterior probability density π(θ x) likelihood prior Julien Berestycki (University of Oxford) SB2a MT 2016 14 / 20

Example 1 Prior probability density π(θ) = θa 1 (1 θ) b 1, 0 < θ < 1. B(a, b) Likelihood ( ) n f (x θ) = θ x (1 θ) n x, x = 0,..., n x Posterior probability density π(θ x) likelihood prior θ a 1 (1 θ) b 1 θ x (1 θ) n x Julien Berestycki (University of Oxford) SB2a MT 2016 14 / 20

Example 1 Prior probability density π(θ) = θa 1 (1 θ) b 1, 0 < θ < 1. B(a, b) Likelihood ( ) n f (x θ) = θ x (1 θ) n x, x = 0,..., n x Posterior probability density π(θ x) likelihood prior θ a 1 (1 θ) b 1 θ x (1 θ) n x = θ a+x 1 (1 θ) n x+b 1 Julien Berestycki (University of Oxford) SB2a MT 2016 14 / 20

Example 1 Prior probability density π(θ) = θa 1 (1 θ) b 1, 0 < θ < 1. B(a, b) Likelihood ( ) n f (x θ) = θ x (1 θ) n x, x = 0,..., n x Posterior probability density π(θ x) likelihood prior θ a 1 (1 θ) b 1 θ x (1 θ) n x = θ a+x 1 (1 θ) n x+b 1 Here posterior has the same form as the prior (conjugacy) with updated parameters a, b replaced by a + x, b + n x Julien Berestycki (University of Oxford) SB2a MT 2016 14 / 20

Example 1 Prior probability density π(θ) = θa 1 (1 θ) b 1, 0 < θ < 1. B(a, b) Likelihood ( ) n f (x θ) = θ x (1 θ) n x, x = 0,..., n x Posterior probability density π(θ x) likelihood prior θ a 1 (1 θ) b 1 θ x (1 θ) n x = θ a+x 1 (1 θ) n x+b 1 Here posterior has the same form as the prior (conjugacy) with updated parameters a, b replaced by a + x, b + n x, so π(θ x) = θa+x 1 (1 θ) n x+b 1 B(a + x, b + n x) Julien Berestycki (University of Oxford) SB2a MT 2016 14 / 20

Example 1 For a Beta distribution with parameters a, b µ = a a + b, σ2 = ab (a + b) 2 (a + b + 1) Julien Berestycki (University of Oxford) SB2a MT 2016 15 / 20

Example 1 For a Beta distribution with parameters a, b µ = a a + b, σ2 = The posterior mean and variance are ab (a + b) 2 (a + b + 1) a + X a + b + n, (a + X)(b + n X) (a + b + n) 2 (a + b + n + 1) Julien Berestycki (University of Oxford) SB2a MT 2016 15 / 20

Example 1 For a Beta distribution with parameters a, b µ = a a + b, σ2 = The posterior mean and variance are ab (a + b) 2 (a + b + 1) a + X a + b + n, (a + X)(b + n X) (a + b + n) 2 (a + b + n + 1) Suppose X = n and we set a = b = 1 for our prior. Then posterior mean is n + 1 n + 2 i.e. when we observe events of just one type then our point estimate is not 0 or 1 (which is sensible especially in small sample sizes). Julien Berestycki (University of Oxford) SB2a MT 2016 15 / 20

Example 1 For large n, the posterior mean and variance are approximately In classical statistics X X(n X), n n 3 θ = X n, θ(1 θ) n = X(n X) n 3 Julien Berestycki (University of Oxford) SB2a MT 2016 16 / 20

Example 1 Suppose ϑ = 0.3 but prior mean is 0.7 with std 0.1 Julien Berestycki (University of Oxford) SB2a MT 2016 17 / 20

Example 1 Suppose ϑ = 0.3 but prior mean is 0.7 with std 0.1. Suppose data X Bin(n, ϑ) with n = 10 (yielding X = 3) and then n = 100 (yielding X = 30, say). Julien Berestycki (University of Oxford) SB2a MT 2016 17 / 20

Example 1 Suppose ϑ = 0.3 but prior mean is 0.7 with std 0.1. Suppose data X Bin(n, ϑ) with n = 10 (yielding X = 3) and then n = 100 (yielding X = 30, say). probability density 0 2 4 6 8 10 post n=100 post n=10 prior 0.0 0.2 0.4 0.6 0.8 1.0 theta Julien Berestycki (University of Oxford) SB2a MT 2016 17 / 20

Example 1 Suppose ϑ = 0.3 but prior mean is 0.7 with std 0.1. Suppose data X Bin(n, ϑ) with n = 10 (yielding X = 3) and then n = 100 (yielding X = 30, say). probability density 0 2 4 6 8 10 post n=100 post n=10 prior 0.0 0.2 0.4 0.6 0.8 1.0 theta As n increases, the likelihood overwhelms information in prior. Julien Berestycki (University of Oxford) SB2a MT 2016 17 / 20

Example 2 - Conjugate priors Normal distribution when the mean and variance are unknown. Julien Berestycki (University of Oxford) SB2a MT 2016 18 / 20

Example 2 - Conjugate priors Normal distribution when the mean and variance are unknown.let τ = 1/σ 2, θ = (τ, µ). τ is called the precision. Julien Berestycki (University of Oxford) SB2a MT 2016 18 / 20

Example 2 - Conjugate priors Normal distribution when the mean and variance are unknown.let τ = 1/σ 2, θ = (τ, µ). τ is called the precision. The prior τ has a Gamma distribution with parameters α, β > 0, and conditional on τ, µ has a distribution N(ν, 1 kτ ) for some k > 0, ν R. Julien Berestycki (University of Oxford) SB2a MT 2016 18 / 20

Example 2 - Conjugate priors Normal distribution when the mean and variance are unknown.let τ = 1/σ 2, θ = (τ, µ). τ is called the precision. The prior τ has a Gamma distribution with parameters α, β > 0, and conditional on τ, µ has a distribution N(ν, 1 kτ ) for some k > 0, ν R. The prior is π(τ, µ) = βα Γ(α) τ α 1 e βτ Julien Berestycki (University of Oxford) SB2a MT 2016 18 / 20

Example 2 - Conjugate priors Normal distribution when the mean and variance are unknown.let τ = 1/σ 2, θ = (τ, µ). τ is called the precision. The prior τ has a Gamma distribution with parameters α, β > 0, and conditional on τ, µ has a distribution N(ν, 1 ) for some k > 0, ν R. The prior is π(τ, µ) = kτ { βα Γ(α) τ α 1 e βτ (2π) 1/2 (kτ) 1/2 exp kτ } (µ ν)2 2 Julien Berestycki (University of Oxford) SB2a MT 2016 18 / 20

Example 2 - Conjugate priors Normal distribution when the mean and variance are unknown.let τ = 1/σ 2, θ = (τ, µ). τ is called the precision. The prior τ has a Gamma distribution with parameters α, β > 0, and conditional on τ, µ has a distribution N(ν, 1 ) for some k > 0, ν R. The prior is or π(τ, µ) = kτ { βα Γ(α) τ α 1 e βτ (2π) 1/2 (kτ) 1/2 exp kτ } (µ ν)2 2 [ π(τ, µ) τ α 1/2 exp τ {β + k2 }] (µ ν)2 Julien Berestycki (University of Oxford) SB2a MT 2016 18 / 20

Example 2 - Conjugate priors Normal distribution when the mean and variance are unknown.let τ = 1/σ 2, θ = (τ, µ). τ is called the precision. The prior τ has a Gamma distribution with parameters α, β > 0, and conditional on τ, µ has a distribution N(ν, 1 ) for some k > 0, ν R. The prior is or π(τ, µ) = kτ { βα Γ(α) τ α 1 e βτ (2π) 1/2 (kτ) 1/2 exp kτ } (µ ν)2 2 [ π(τ, µ) τ α 1/2 exp τ {β + k2 }] (µ ν)2 The likelihood is { f (x µ, τ) = (2π) n/2 τ n/2 exp τ 2 } n (x i µ) 2 i=1 Julien Berestycki (University of Oxford) SB2a MT 2016 18 / 20

Example 2 - Conjugate priors Thus [ { π(τ, µ x) τ α+(n/2) 1/2 exp τ β + k 2 (µ ν)2 + 1 2 }] n (x i µ) 2 i=1 Julien Berestycki (University of Oxford) SB2a MT 2016 19 / 20

Example 2 - Conjugate priors Thus [ { π(τ, µ x) τ α+(n/2) 1/2 exp τ β + k 2 (µ ν)2 + 1 2 }] n (x i µ) 2 i=1 Complete the square to see that k(µ ν) 2 + (x i µ) 2 ( ) kν + n x 2 = (k + n) µ + nk k + n n + k ( x ν)2 + (x i x) 2 Julien Berestycki (University of Oxford) SB2a MT 2016 19 / 20

Example 2 - Completing the square Your posterior dependence on µ is entirely in the factor [ { }] exp τ β + k 2 (µ ν)2 + 1 n (x i µ) 2 2 i=1 Julien Berestycki (University of Oxford) SB2a MT 2016 20 / 20

Example 2 - Completing the square Your posterior dependence on µ is entirely in the factor [ { }] exp τ β + k 2 (µ ν)2 + 1 n (x i µ) 2 2 i=1 The idea is to try to write that as [ c exp τ {β + k }] 2 (ν µ) 2 so that (conditional on τ) we recognize a Normal density. Julien Berestycki (University of Oxford) SB2a MT 2016 20 / 20

Example 2 - Completing the square Your posterior dependence on µ is entirely in the factor [ { }] exp τ β + k 2 (µ ν)2 + 1 n (x i µ) 2 2 i=1 The idea is to try to write that as [ c exp τ {β + k }] 2 (ν µ) 2 so that (conditional on τ) we recognize a Normal density. k(µ ν) 2 + (x i µ) 2 = µ 2 (k + n) µ(2kν + 2 x i ) +... ( = (k + n) µ kν + n x k + n ) 2 + nk n + k ( x ν)2 + (x i x) 2 Julien Berestycki (University of Oxford) SB2a MT 2016 20 / 20

Example 2 - Conjugate priors Thus the posterior is where π(τ, µ x) τ α 1/2 exp [ τ {β + k }] 2 (ν µ) 2 α = α + n 2 β = β + 1 2 nk n + k ( x ν)2 + 1 (xi x) 2 2 k = k + n ν = kν + n x k + n This is the same form as the prior, so the class is conjugate prior. Julien Berestycki (University of Oxford) SB2a MT 2016 21 / 20

Example 2 - Contd If we are interested in the posterior distribution of µ alone π(µ x) = π(τ, µ x)dτ Julien Berestycki (University of Oxford) SB2a MT 2016 22 / 20

Example 2 - Contd If we are interested in the posterior distribution of µ alone π(µ x) = π(τ, µ x)dτ = π(µ τ, x)π(τ x)dτ Here, we have a simplification if we assume 2α = m N. Then τ = W /2β, µ = ν + Z / kτ with W χ 2 m and Z N(0, 1). Julien Berestycki (University of Oxford) SB2a MT 2016 22 / 20

Example 2 - Contd If we are interested in the posterior distribution of µ alone π(µ x) = π(τ, µ x)dτ = π(µ τ, x)π(τ x)dτ Here, we have a simplification if we assume 2α = m N. Then τ = W /2β, µ = ν + Z / kτ with W χ 2 m and Z N(0, 1). Recall that Z m/w t m (Studen with m d.f.) we see that the prior of µ is km 2β (µ ν) t m and the posterior is k m 2β (µ ν ) t m, m = m + n Julien Berestycki (University of Oxford) SB2a MT 2016 22 / 20

Example: estimating the probability of female birth given placenta previa Result of german study: 980 birth, 437 females. In general population the proportion is 0.485. Julien Berestycki (University of Oxford) SB2a MT 2016 23 / 20

Example: estimating the probability of female birth given placenta previa Result of german study: 980 birth, 437 females. In general population the proportion is 0.485. Using a uniform (Beta(1,1)) prior, posterior is Beta(438,544). post. mean = 0.446 post. std dev = 0.016 central 95% post. interval = [0.415, 0.477] Julien Berestycki (University of Oxford) SB2a MT 2016 23 / 20

Example: estimating the probability of female birth given placenta previa Result of german study: 980 birth, 437 females. In general population the proportion is 0.485. Using a uniform (Beta(1,1)) prior, posterior is Beta(438,544). post. mean = 0.446 post. std dev = 0.016 central 95% post. interval = [0.415, 0.477] Sensibility to proposed prior. α/β 2 = prior sample size". Julien Berestycki (University of Oxford) SB2a MT 2016 23 / 20