Part 2: One-parameter models

Similar documents
Bernoulli and Poisson models

Part 4: Multi-parameter and normal models

One-parameter models

Foundations of Statistical Inference

Part 6: Multivariate Normal and Linear Models

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Unobservable Parameter. Observed Random Sample. Calculate Posterior. Choosing Prior. Conjugate prior. population proportion, p prior:

The binomial model. Assume a uniform prior distribution on p(θ). Write the pdf for this distribution.

STAT 425: Introduction to Bayesian Analysis

Introduction to Bayesian Methods

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

CS 361: Probability & Statistics

Inference for a Population Proportion

Bayesian Statistics. Debdeep Pati Florida State University. February 11, 2016

Bayesian Inference. Chapter 2: Conjugate models

Probability and Estimation. Alan Moses

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

Using Probability to do Statistics.

One Parameter Models

Introduction to Applied Bayesian Modeling. ICPSR Day 4

Likelihood and Bayesian Inference for Proportions

Other Noninformative Priors

Quiz 1. Name: Instructions: Closed book, notes, and no electronic devices.

Stat 5101 Lecture Notes

2 Belief, probability and exchangeability

Likelihood and Bayesian Inference for Proportions

Chapter 5. Bayesian Statistics

CS 361: Probability & Statistics

Computational Perception. Bayesian Inference

Bayesian Inference for Normal Mean

Bayesian Inference. p(y)

Linear Models A linear model is defined by the expression

Part 7: Hierarchical Modeling

Bayesian Statistics Part III: Building Bayes Theorem Part IV: Prior Specification

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

STAT J535: Chapter 5: Classes of Bayesian Priors

A Very Brief Summary of Statistical Inference, and Examples

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Predictive Distributions

Module 4: Bayesian Methods Lecture 9 A: Default prior selection. Outline

CHAPTER 2 Estimating Probabilities

A Very Brief Summary of Bayesian Inference, and Examples

Lecture 3. Univariate Bayesian inference: conjugate analysis

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

STAT J535: Introduction

A Discussion of the Bayesian Approach

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Classical and Bayesian inference

Bayesian Models in Machine Learning

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

ebay/google short course: Problem set 2

Conditional distributions

Midterm Examination. STA 215: Statistical Inference. Due Wednesday, 2006 Mar 8, 1:15 pm

Bayesian Inference: Posterior Intervals

Principles of Bayesian Inference

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Data Analysis and Uncertainty Part 2: Estimation

Part 8: GLMs and Hierarchical LMs and GLMs

1 Introduction. P (n = 1 red ball drawn) =

Multinomial Data. f(y θ) θ y i. where θ i is the probability that a given trial results in category i, i = 1,..., k. The parameter space is

Chapter 4. Bayesian inference. 4.1 Estimation. Point estimates. Interval estimates

Lecture 2: Repetition of probability theory and statistics

Neutral Bayesian reference models for incidence rates of (rare) clinical events

HPD Intervals / Regions

Bayesian inference for sample surveys. Roderick Little Module 2: Bayesian models for simple random samples

Classical and Bayesian inference

Class 26: review for final exam 18.05, Spring 2014

Beta statistics. Keywords. Bayes theorem. Bayes rule

[Chapter 6. Functions of Random Variables]

General Bayesian Inference I

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

1 Hypothesis Testing and Model Selection

Lecture 6. Prior distributions

Bayes Factors for Grouped Data

Generalized Linear Models Introduction

Bayesian vs frequentist techniques for the analysis of binary outcome data

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Poisson CI s. Robert L. Wolpert Department of Statistical Science Duke University, Durham, NC, USA

2018 SISG Module 20: Bayesian Statistics for Genetics Lecture 2: Review of Probability and Bayes Theorem

Practice Problems Section Problems

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Gibbs Sampling in Latent Variable Models #1

Bayesian Linear Regression

10. Exchangeability and hierarchical models Objective. Recommended reading

Time Series and Dynamic Models

Bayesian Methods: Naïve Bayes

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Introduction to Probabilistic Machine Learning

(4) One-parameter models - Beta/binomial. ST440/550: Applied Bayesian Statistics

Recommendations for presentation of error bars

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

INTRODUCTION TO BAYESIAN ANALYSIS

Lecture 2: Priors and Conjugacy

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Statistical Theory MT 2006 Problems 4: Solution sketches

Transcription:

Part 2: One-parameter models 1

Bernoulli/binomial models Return to iid Y 1,...,Y n Bin(1, ). The sampling model/likelihood is p(y 1,...,y n ) = P y i (1 ) n P y i When combined with a prior p( ), Bayes rule gives the posterior p( y 1,...,y n )= P y i (1 ) n p(y 1,...,y n ) P yi p( ) P yi is a sufficient statistic 2

The Binomial model iid When Y 1,...,Y n Bin(1, ), the sufficient statistic Y = P n has a Bin(n, ) distribution i=1 Y i Having observed {Y = y} our task is to obtain the posterior distribution of : p( y) = p(y )p( ) p(y) = n y y (1 ) n y p( ) p(y) = c(y) y (1 ) n y p( ) where c(y) is a function of y and not 3

A uniform prior The parameter 0 and 1 is some unknown number between Suppose our prior information for is such that all subintervals of [0, 1] having the same length also have the same probability P (a apple apple b) =P (a + c apple apple b + c), for 0 apple a<b<b+ c apple 1 This condition implies a uniform density for : p( ) = 1 for all 2 [0, 1] 4

Normalizing constant Using the following result from calculus Z 1 0 a 1 (1 ) b 1 d = (a) (b) (a + b) where (n) =(n 1)! is evaluated in R as gamma(n) we can find out what Z 1 c(y) is under the uniform prior p( y) d =1 0 ) c(y) = (n + 2) (y + 1) (n y + 1) 5

Beta posterior So the posterior distribution is p( y) = = To wrap up: (n + 2) (y + 1) (n y + 1) y (1 ) n y (n + 2) (y + 1) (n y + 1) (y+1) 1 (1 ) (n y+1) 1 = Beta(y +1,n y + 1) a uniform prior, plus a Bernoulli/Binomial likelihood (sampling model) gives a Beta posterior 6

Example: Happiness Each female of age 65 or over in the 1998 General Social Survey was asked whether or not they were generally happy Let Y i =1if respondent i reported being generally happy, and Y i =0otherwise, for i =1,...,n= 129 individuals Since 129 N, the total size of the female senior citizen population, our joint beliefs about Y 1,...,Y 129 are well approximated by (the sampling model) our beliefs about = P N i=1 Y i/n the model that, conditional on, the Y i s are IID Bernoulli RVs with expectation 7

Example: Happiness IID Bernoulli likelihood This sampling model says that the probability for any potential outcome {y 1,...,y n }, conditional on, is given by p(y 1,...,y 129 ) = P 129 i=1 y i (1 ) 129 P 129 i=1 y i The survey revealed that 118 individuals report being generally happy (91%) 11 individuals do not (9%) So the probability of these data for a given value of p(y 1,...,y 129 ) = 118 (1 ) 11 is 8

Example: Happiness Binomial likelihood In the binomial formulation nx Y = Y i Bin(n, ) i=1 and our observed data is y = 118 So (now) the probability of these data for a given value of is 129 p(y ) = 118 118 (1 ) 11 9

Example: Happiness Posterior If we take a uniform prior for, expressing our ignorance, then we know the posterior is p( y) = Beta(y +1,n y + 1) For n = 129 and y = 118, this gives {Y = 118} Beta(119, 12) density 0 5 10 15 posterior prior 0.0 0.2 0.4 0.6 0.8 1.0 10 theta

Uniform is Beta The uniform prior has p( ) =1for all 2 [0, 1] This can be thought of as a Beta distribution with parameters a =1,b=1 p( ) = (2) (1) (1) 1 1 (1 ) 1 1 = 1 1 1 1 1 (under the convention that (1) = 1) Beta(1, 1) We saw that if Y Bin(n, ) then {Y = y} Beta(1 + y, 1+n y) 11

General Beta prior Suppose Beta(a, b) and Y Bin(n, ) p( y) / a+y 1 (1 ) b+n y 1 ) y Beta(a + y, b + n y) In other words, the constant of proportionality must be: c(n, y, a, b) = (a + b + n) (a + y) (b + n y) How do I know? p( y) (and the beta density) must integrate to 1 12

Posterior proportionality We will use this trick over and over again! I.e., if we recognize that the posterior distribution is proportional to a known probability density, then it must be identical to that density But Careful! The constant of proportionality must be constant with respect to 13

Conjugacy We have shown that a beta prior distribution and a binomial sampling model lead to a beta posterior distribution To reflect this, we say that the class of beta priors is conjugate for the binomial sampling model Formally, we say that a class P of prior distributions for is conjugate for a sampling model p(y ) if p( ) 2 P ) p( y) 2 P Conjugate priors make posterior calculations easy, but might not actually represent our prior information 14

not including quantiles, etc. 15 Posterior summaries If we are interested in point-summaries of our posterior inference, then the (full) distribution gives us many points to choose from E.g., if {Y = y} Beta(a + y, b + n y) then E{ y} = a + y a + b + n mode( y) = a + y 1 a + b + n 2 Var[ y] = E{ y}e{1 y} a + b + n +1

Combining information The posterior expectation E{ y} is easily recognized as a combination of prior and data information: E{ y} = a + b a + b + n prior expectation + n a + b + n data average I.e., for this model and prior distribution, the posterior mean is a weighted average of the prior mean and the sample average with weights proportional to a + b and n respectively 16

Prior sensitivity This leads to the interpretation of a and b as prior data : a prior number of 1 s b prior number of 0 s a + b prior sample size If n a + b, then it seems reasonable that most of our information should be coming from the data as opposed to the prior Indeed: a + b a + b + n 0 E{ y} y n & Var[ y] 1 n y 1 y 17 n n

Prediction An important feature of Bayesian inference is the existence of a predictive distribution for new observations Revert, for the moment, to our notation for Bernoulli data. Let y 1,...,y n, be the outcomes of a sample of n binary RVs Let Ỹ 2 {0, 1} be an additional outcome from the same population that has yet to be observed The predictive distribution of Ỹ is the conditional distribution of Ỹ given {Y 1 = y 1,...,Y n = y n } 18

Predictive distribution For conditionally IID binary variables the predictive distribution can be derived from the distribution of Ỹ given and the posterior distribution of p(ỹ =1 y 1,...,y n )=E{ y 1,...,y n } = a + P N i=1 y i a + b + n p(ỹ =0 y 1,...,y n )=1 p(ỹ =1 y 1,...,y n ) b + n P N i=1 y i 19 a + b + n

Two points about prediction 1. The predictive distribution does not depend upon any unknown quantities if it did, we would not be able to use it to make predictions 2. The predictive distribution depends on our observed data i.e. Ỹ is not independent of Y 1,...,Y n this is because observing Y 1,...,Y n gives information about Ỹ, which in turn gives information about 20

Example: Posterior predictive Consider a uniform prior, or Y = P n i=1 y i P (Ỹ Beta(1, 1), using =1 Y = y) =E{ Y = y} = 2 2+n 1 2 + n 2+n y n but mode( Y = y) = y n Does this discrepancy between these two posterior summaries of our information make sense? Consider the case in which Y =0, for which mode( Y = 0) = 0 but P (Ỹ =1 Y = 0) = 1/(2 + n) 21

Confidence regions An interval [l(y),u(y)], based on the observed data Y = y, has 95% Bayesian coverage for if P (l(y) < <u(y) Y = y) =0.95 The interpretation of this interval is that it describes your information about the true value of after you have observed Y = y Such intervals are typically called credible intervals, to distinguish them from frequentist confidence intervals which is an interval that describes a region (based upon ˆ ) wherein the true 0 lies 95% of the time Both, confusingly, use the acronym CI 22

Quantile-based (Bayesian) CI Perhaps the easiest way to obtain a credible interval is to use the posterior quantiles To make a 100 (1 ) % quantile-based CI, find numbers /2 < 1 /2 such that (1) P ( < /2 Y = y) = /2 (2) P ( > 1 /2 Y = y) = /2 The numbers /2, 1 /2 are the /2 and 1 /2 posterior quantiles of 23

Example: Binomial sampling and uniform prior Suppose out of n = 10 conditionally independent draws of a binary random variable we observe Y =2 ones Using a uniform prior distribution for, the posterior distribution is {Y =2} Beta(1 + 2, 1 + 8) A 95% CI can be obtained from the 0.025 and 0.975 quantiles of this beta distribution These quantiles are 0.06 and 0.52, respectively, and so the posterior probability that is 95% 2 [0.06, 0.52] 24

Example: a quantile-based CI drawback Notice that there are -values outside the quantilebased CI that have higher probability [density] than some points inside the interval density 0.0 0.5 1.0 1.5 2.0 2.5 3.0 posterior 95% CI 0.0 0.2 0.4 0.6 0.8 1.0 theta 25

Example: Estimating the probability of a female birth The proportion of births that are female has long been a topic of interest both scientifically and to the lay public E.g., Laplace, perhaps the first Bayesian, analyzed Parisian birth data from 1745 with a uniform prior and a binomial sampling model 241,945 girls, and 251,527 boys 26

Example: female birth posterior The posterior distribution for is (n = 493472) {Y = 249145} Beta(241945 + 1, 251527 + 1) Now, we can use the posterior CDF to calculate P ( > 0.5 Y = 241945) = pbeta(0.5,241945+1, 251527+1, lower.tail=false) =1.15 10 42 27

The Poisson model Some measurements, such as a person s number of children or number of friends, have values that are whole numbers In these cases the sample space is Y = {0, 1, 2,...} Perhaps the simplest probability model on Poisson model Y is the A RV Y has a Poisson distribution with mean if P (Y = y ) = y e y! for y 2 {0, 1, 2,...} 28

IID Sampling model If we take is: Y 1,...,Y n iid Pois( ) then the joint PDF Note that P n i=1 Y i Pois(n ) 29

Conjugate prior Recall that a class of priors is conjugate for a sampling model if the posterior is also in that class For the Poisson model, our posterior distribution for has the following form p( y 1,...,y m ) / p(y 1,...,y n )p( ) / P y i e n p( ) This means that whatever our conjugate class of densities is, it will have to include terms like c 1 e c 2 for numbers c 1 and c 2 30

Conjugate gamma prior The simplest class of probability distributions matching this description is the gamma family p( )= ba (a) a 1 e b for,a,b>0 Some properties include E{ } = a b Var[ ] = a b 2 (a 1)/b if a>1 mode( ) = 0 if a apple 1 We shall denote this family as G(a, b) 31

Posterior distribution Suppose Y 1,...Y n iid Pois( ) and G(a, b) Then p( y 1,...,y n ) / a+p y i 1 e (b+n) This is evidently a gamma distribution, and we have confirmed the conjugacy of the gamma family for the Poisson sampling model G(a, b) Y 1,...,Y n Pois( ) ) nx! { Y 1,...,Y n } G a + Y i,b+ n 32 i=1

Combining information Estimation and prediction proceed in a manner similar to that in the binomial model The posterior expectation of is a convex combination of the prior expectation and the sample average b a n P yi E{ y 1,...,y n } = b + n b + b + n n b is the number of prior observations a is the sum of the counts from b prior observations For large n, the information in the data dominates n b ) E{ y 1,...y n } ȳ, Var[ y 1,...,y n ] ȳ 33 n

Posterior predictive Predictions about additional data can be obtained with the posterior predictive distribution for ỹ 2 {0, 1, 2,...,} 34

Posterior predictive This is a negative binomial distribution with parameters (a + P y i,b+ n) for which E{Ỹ y 1,...,y n } = a + P y i b + n = E{ y 1,...,y n } Var[Ỹ y 1,...,y n ]= a + P y i b + n b + n +1 b + n = Var[ y 1,...,y n ] (b + n + 1) = E{ y 1,...,y n } b + n +1 b + n 35

Example: birth rates and education Over the course of the 1990s the General Societal Survey gathered data on the educational attainment and number of children of 155 women who were 40 years of age at the time of their participation in the survey These women were in their 20s in the 1970s, a period of historically low fertility rates in the United States In this example we will compare the women with college degrees to those without in terms of their numbers of children 36

Example: sampling model(s) Let Y 1,1,...,Y n1,1 denote the numbers of children for the n 1 women without college degrees and n 2 Y 1,2,...,Y n2,2 degrees be the data for the women with For this example, we will use the following sampling models Y 1,1,...,Y n1,1 Y 1,2,...,Y n2,2 iid Pois( 1 ) iid Pois( 2 ) The (sufficient) data are no bachelors: bachelors: n 1 = 111, P n 1 i=1 y i,1 = 217, ȳ 1 =1.95 n 2 = 44, P n 2 i=1 y i,2 = 66, ȳ 2 =1.50 37

Example: prior(s) and posterior(s) In the case were we have the following posterior distributions: 1 {n 1 = 111, P Y i,1 = 217} G(2 + 217, 1 + 111) 2 {n 1 = 44, P Y i,2 = 66} G(2 + 66, 1 + 44) Posterior means, modes and 95% quantile-based CIs for 1 and 2 can be obtained from their gamma posterior distributions 38

Example: prior(s) and posterior(s) density 0.0 0.5 1.0 1.5 2.0 2.5 3.0 prior post none post bach 0 1 2 3 4 5 theta The posterior(s) provide substantial evidence that 1 > 2. E.g., P ( 1 > 2 P Y i,1 = 217, P Y i,2 = 66) = 0.97 39

Example: posterior predictive(s) Consider a randomly sampled individual from each population. To what extent do we expect the uneducated one to have more children than the other? probability 0.00 0.05 0.10 0.15 0.20 0.25 0.30 none bach P (Ỹ1 > Ỹ2 P Y i,1 = 217, P Y i,2 = 66) = 0.48 P (Ỹ1 = Ỹ2 P Y i,1 = 217, P Y i,2 = 66) = 0.22 0 1 2 3 4 5 6 7 8 9 10 children The distinction between { 1 > 2 } and {Ỹ1 > Ỹ2} is important: strong evidence of a difference between two populations does not mean the distance is large 40

Non-informative priors Priors can be difficult to construct There has long been a desire for prior distributions that can be guaranteed to play a minimal role in the posterior distribution Such distributions are sometimes called reference priors, and described as vague, flat, diffuse, or non-informative The rationale for using non-informative priors is often said to be to let the data speak for themselves, so that inferences are unaffected by the information external to the current data 41

Jeffreys invariance principle One approach that is sometimes used to define a noninformative prior distribution was introduced by Jeffreys (1946), based on considering one-to-one transformations of the parameter Jeffreys general principle is that any rule for determining the prior density should yield an equivalent posterior if applied to the transformed parameter Naturally, one must take the sampling model into account in order to study this invariance 42

The Jeffreys prior Jeffreys principle leads to defining the non-informative prior as p( ) / [i( )] 1/2, where i( ) is the Fisher information for Recall that i( ) =E ( d d log p(y ) 2 ) = E d 2 log p(y ) d 2 A prior so-constructed is called the Jeffreys prior for the sampling model p(y ) 43

Example: Jeffreys prior for the binomial sampling model Consider the binomial distribution Y Bin(n, ) which has log-likelihood log p(y ) =c + y log +(n y) log(1 ) Therefore i( ) = E d 2 d 2 log p(y ) = n (1 ) So the Jeffreys prior density is then p( ) / 1/2 (1 ) 1/2 which is the same as Beta( 1 2, 1 2 ) 44

Example: non-informative? This Beta( 1 2, 1 2 ) Jeffreys prior is non-informative in some sense, but not in all senses The posterior expectation would give a + b =1 sample(s) worth of weight to the prior mean This Jeffreys prior less informative than the uniform prior which gives a weight of 2 samples to the prior mean (also 1) In this sense, the least informative prior is Beta(0, 0) But is this a valid prior distribution? 45

Propriety of priors We call a prior density p( ) proper if it does not depend on the data and is integrable, i.e., Z p( ) d < 1 Proper priors lead to valid joint Rprobability models p(,y) and proper posteriors ( p( y) d < 1) Improper priors do not provide valid joint probability models, but they can lead to proper posterior by proceeding with the Bayesian algebra p( y) / p(y )p( ) 46

Example: continued The Beta(0, 0) Z 1 prior is improper since 0 1 (1 ) 1 d = 1 However, the Beta(0 + y, 0+n y) posterior that results is proper as long as y 6= 0,n A Beta(, ) prior, for 0 < 1 is always proper, and (in the limit) non-informative In practice, the difference between these alternatives (Beta(0, 0), Beta( 1 2, 1 2 ), Beta(1, 1)) is small; all three allow the data (likelihood) to dominate in the posterior 47

Example: Jeffreys prior for the Poisson sampling model Consider the Poisson distribution has log-likelihood Y Pois( ) which log p(y ) =c + y log Therefore i( ) = E d 2 log p(y ) d 2 = 1 So the Jeffreys prior density is then p( ) / 1/2 48

Example: Improper Jeffreys prior R 1 0 1/2 d = 1 Since this prior is improper However, we can interpret is as a G( 1, i.e., within 2, 0) the conjugate gamma family, to see that the posterior is G 1 2 + n X i=1 y i, 0+n! This is proper as long as we have observed one data point, i.e., n 1, since 49

Example: non-informative? Again, this G( 1 2, 0) Jeffreys prior is non-informative in some sense, but not in all senses The (improper) prior p( ) / 1, or G(0, 0), is in some sense equivalent since it gives the same weight (b = 0) to the prior mean (although different) It can be shown that, in the limit as the prior parameters (a, b)! (0, 0) the posterior expectation approaches the MLE p( ) / 1 is a typical default for this reason As before, the practical difference between these choices is small 50

Notes of caution The search for non-informative priors has several problems, including it can be misguided: if the likelihood (data) is truly dominant, the choice among relatively flat priors cannot matter for many problems there is no clear choice Bottom line: If so few data are available that the choice of non-informative prior matters, then one should put relevant information into the prior 51