INTRODUCTION TO BAYESIAN ANALYSIS

Size: px

Start display at page:

Download "INTRODUCTION TO BAYESIAN ANALYSIS"

Beryl Holland
6 years ago
Views:

1 INTRODUCTION TO BAYESIAN ANALYSIS Arto Luoma University of Tampere, Finland Autumn 2014 Introduction to Bayesian analysis, autumn 2013 University of Tampere 1 / 130

Who was Thomas Bayes? Thomas Bayes (1701-1761) was an English philosopher and Presbyterian minister. In his later years he took a deep interest in probability.

2 Who was Thomas Bayes? Thomas Bayes ( ) was an English philosopher and Presbyterian minister. In his later years he took a deep interest in probability. He suggested a solution to a problem of inverse probability. What do we know about the probability of success if the number of successes is recorded in a binomial experiment? Richard Price discovered Bayes essay and published it posthumously. He believed that Bayes Theorem helped prove the existence of God. Introduction to Bayesian analysis, autumn 2013 University of Tampere 2 / 130

3 Bayesian paradigm Bayesian paradigm: posterior information = prior information + data information Introduction to Bayesian analysis, autumn 2013 University of Tampere 3 / 130

4 Bayesian paradigm Bayesian paradigm: posterior information = prior information + data information More formally: p(θ y) p(θ)p(y θ), where is a symbol for proportionality, θ is an unknown parameter, y is data, and p(θ), p(θ y) and p(y θ) are the density functions of the prior, posterior and sampling s, respectively. Introduction to Bayesian analysis, autumn 2013 University of Tampere 3 / 130

5 Bayesian paradigm Bayesian paradigm: posterior information = prior information + data information More formally: p(θ y) p(θ)p(y θ), where is a symbol for proportionality, θ is an unknown parameter, y is data, and p(θ), p(θ y) and p(y θ) are the density functions of the prior, posterior and sampling s, respectively. In Bayesian inference, the unknown parameter θ is considered stochastic, unlike in classical inference. The s p(θ) and p(θ y) express uncertainty about the exact value of θ. The density of data, p(y θ), provides information from the data. It is called a likelihood function when considered a function of θ. Introduction to Bayesian analysis, autumn 2013 University of Tampere 3 / 130

6 Software for Bayesian Statistics In this course we use the R and BUGS programming languages. BUGS stands for Bayesian inference Using Gibbs Sampling. Gibbs sampling was the computational technique first adopted for Bayesian analysis. The goal of the BUGS project is to separate the knowledge base from the inference machine used to draw conclusions. BUGS language is able to describe complex using very limited syntax. Introduction to Bayesian analysis, autumn 2013 University of Tampere 4 / 130

7 Software for Bayesian Statistics In this course we use the R and BUGS programming languages. BUGS stands for Bayesian inference Using Gibbs Sampling. Gibbs sampling was the computational technique first adopted for Bayesian analysis. The goal of the BUGS project is to separate the knowledge base from the inference machine used to draw conclusions. BUGS language is able to describe complex using very limited syntax. There are three widely used BUGS implementations: WinBUGS, OpenBUGS and JAGS. Both WinBUGS and OpenBUGS have a Windows GUI. Further, each engine can be controlled from R. In this course we introduce rjags, the R interface to JAGS. Introduction to Bayesian analysis, autumn 2013 University of Tampere 4 / 130

8 Contents of the course Introduction to Bayesian analysis, autumn 2013 University of Tampere 5 / 130

9 Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Introduction to Bayesian analysis, autumn 2013 University of Tampere 6 / 130

10 Bayes theorem Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Let A 1,A 2,...,A k be events that partition the sample space Ω, (i.e. Ω = A 1 A 2... A k and A i A j = when i j) and let B an event on that space for which Pr(B) > 0. Then Bayes theorem is Pr(A j B) = Pr(A j )Pr(B A j ) k j=1 Pr(A j)pr(b A j ). Introduction to Bayesian analysis, autumn 2013 University of Tampere 7 / 130

11 Bayes theorem Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Let A 1,A 2,...,A k be events that partition the sample space Ω, (i.e. Ω = A 1 A 2... A k and A i A j = when i j) and let B an event on that space for which Pr(B) > 0. Then Bayes theorem is Pr(A j B) = Pr(A j )Pr(B A j ) k j=1 Pr(A j)pr(b A j ). This formula can be used to reverse conditional probabilities. If one knows the probabilities of the events A j and the conditional probabilities Pr(B A j ), j = 1,...,k, the formula can be used to compute the conditinal probabilites Pr(A j B). Introduction to Bayesian analysis, autumn 2013 University of Tampere 7 / 130

12 (Diagnostic tests) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction A disease occurs with prevalence γ in population, and θ indicates that an individual has the disease. Hence Pr(θ = 1) = γ, Pr(θ = 0) = 1 γ. A diagnostic test gives a result Y, whose function is F 1 (y) for a diseased individual and F 0 (y) otherwise. The most common type of test declares that a person is diseased if Y > y 0, where y 0 is fixed on the basis of past data. Introduction to Bayesian analysis, autumn 2013 University of Tampere 8 / 130

13 (Diagnostic tests) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction A disease occurs with prevalence γ in population, and θ indicates that an individual has the disease. Hence Pr(θ = 1) = γ, Pr(θ = 0) = 1 γ. A diagnostic test gives a result Y, whose function is F 1 (y) for a diseased individual and F 0 (y) otherwise. The most common type of test declares that a person is diseased if Y > y 0, where y 0 is fixed on the basis of past data. The probability that a person is diseased, given a positive test result, is = Pr(θ = 1 Y > y 0 ) γ[1 F 1 (y 0 )] γ[1 F 1 (y 0 )]+(1 γ)[1 F 0 (y 0 )]. Introduction to Bayesian analysis, autumn 2013 University of Tampere 8 / 130

14 (Diagnostic tests) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction A disease occurs with prevalence γ in population, and θ indicates that an individual has the disease. Hence Pr(θ = 1) = γ, Pr(θ = 0) = 1 γ. A diagnostic test gives a result Y, whose function is F 1 (y) for a diseased individual and F 0 (y) otherwise. The most common type of test declares that a person is diseased if Y > y 0, where y 0 is fixed on the basis of past data. The probability that a person is diseased, given a positive test result, is = Pr(θ = 1 Y > y 0 ) γ[1 F 1 (y 0 )] γ[1 F 1 (y 0 )]+(1 γ)[1 F 0 (y 0 )]. This is sometimes called the positive predictive value of test. Its sensitivity and specifity are 1 F 1 (y 0 ) and F 0 (y 0 ). ( from Davison, 2003). Introduction to Bayesian analysis, autumn 2013 University of Tampere 8 / 130

15 Prior and posterior s Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction In a more general case, θ can take a finite number of values, labelled 1,..., k. We can assign to these values probabilites p 1,...,p k which express our beliefs about θ before we have access to the data. The data y are assumed to be the observed value of a (multidimensional) random variable Y, and p(y θ) the density of y given θ (the likelihood function). Introduction to Bayesian analysis, autumn 2013 University of Tampere 9 / 130

16 Prior and posterior s Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction In a more general case, θ can take a finite number of values, labelled 1,..., k. We can assign to these values probabilites p 1,...,p k which express our beliefs about θ before we have access to the data. The data y are assumed to be the observed value of a (multidimensional) random variable Y, and p(y θ) the density of y given θ (the likelihood function). Then the conditional probabilites Pr(θ = j Y = y) = p jp(y θ = j) k i=1 p ip(y θ = i), j = 1,...,k, summarize our beliefs about θ after we have observed Y. Introduction to Bayesian analysis, autumn 2013 University of Tampere 9 / 130

17 Prior and posterior s Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction In a more general case, θ can take a finite number of values, labelled 1,..., k. We can assign to these values probabilites p 1,...,p k which express our beliefs about θ before we have access to the data. The data y are assumed to be the observed value of a (multidimensional) random variable Y, and p(y θ) the density of y given θ (the likelihood function). Then the conditional probabilites Pr(θ = j Y = y) = p jp(y θ = j) k i=1 p ip(y θ = i), j = 1,...,k, summarize our beliefs about θ after we have observed Y. The unconditional probabilities p 1,...,p k are called prior probablities and Pr(θ = 1 Y = y),...,pr(θ = k Y = y) are called posterior probabilites of θ. Introduction to Bayesian analysis, autumn 2013 University of Tampere 9 / 130

18 Prior and posterior s (2) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction When θ can get values continuosly on some interval, we can express our beliefs about it with a prior density p(θ). After we have obtained the data y, our beliefs about θ are contained in the conditional density, p(θ y) = p(θ)p(y θ) p(θ)p(y θ)dθ, (1) called posterior density. Introduction to Bayesian analysis, autumn 2013 University of Tampere 10 / 130

19 Prior and posterior s (2) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction When θ can get values continuosly on some interval, we can express our beliefs about it with a prior density p(θ). After we have obtained the data y, our beliefs about θ are contained in the conditional density, p(θ y) = p(θ)p(y θ) p(θ)p(y θ)dθ, (1) called posterior density. Since θ is integrated out in the denominator, it can be considered as a constant with respect to θ. Therefore, the Bayes formula in (1) is often written as p(θ y) p(θ)p(y θ), (2) which denotes that p(θ y) is proportional to p(θ)p(y θ). Introduction to Bayesian analysis, autumn 2013 University of Tampere 10 / 130

20 1 (Introducing a New Drug in the Market) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction A drug company would like to introduce a drug to reduce acid indigestion. It is desirable to estimate θ, the proportion of the market share that this drug will capture. The company interviews n people and Y of them say that they will buy the drug. In the non-bayesian analysis θ [0,1] and Y Bin(n,θ). Introduction to Bayesian analysis, autumn 2013 University of Tampere 11 / 130

21 1 (Introducing a New Drug in the Market) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction A drug company would like to introduce a drug to reduce acid indigestion. It is desirable to estimate θ, the proportion of the market share that this drug will capture. The company interviews n people and Y of them say that they will buy the drug. In the non-bayesian analysis θ [0,1] and Y Bin(n,θ). We know that ˆθ = Y/n is a very good estimator of θ. It is unbiased, consistent and minimum variance unbiased. Moreover, it is also the maximum likelihood estimator (MLE), and thus asymptotically normal. Introduction to Bayesian analysis, autumn 2013 University of Tampere 11 / 130

22 1 (Introducing a New Drug in the Market) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction A drug company would like to introduce a drug to reduce acid indigestion. It is desirable to estimate θ, the proportion of the market share that this drug will capture. The company interviews n people and Y of them say that they will buy the drug. In the non-bayesian analysis θ [0,1] and Y Bin(n,θ). We know that ˆθ = Y/n is a very good estimator of θ. It is unbiased, consistent and minimum variance unbiased. Moreover, it is also the maximum likelihood estimator (MLE), and thus asymptotically normal. A Bayesian may look at the past performance of new drugs of this type. If in the past new drugs tend to capture a proportion between say.05 and.15 of the market, and if all values in between are assumed equally likely, then θ Unif(.05,.15). ( from Rohatgi, 2003). Introduction to Bayesian analysis, autumn 2013 University of Tampere 11 / 130

23 1 (continued) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Thus, the prior is given by { 1/( ) = 10, 0.05 θ 0.15 p(θ) = 0, otherwise. Introduction to Bayesian analysis, autumn 2013 University of Tampere 12 / 130

24 1 (continued) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Thus, the prior is given by { 1/( ) = 10, 0.05 θ 0.15 p(θ) = 0, otherwise. and the likelihood function by ( ) n p(y θ) = θ y (1 θ) n y. y Introduction to Bayesian analysis, autumn 2013 University of Tampere 12 / 130

25 1 (continued) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Thus, the prior is given by { 1/( ) = 10, 0.05 θ 0.15 p(θ) = 0, otherwise. and the likelihood function by ( ) n p(y θ) = θ y (1 θ) n y. y The posterior is { p(θ y) = p(θ)p(y θ) = p(θ)p(y θ)dθ θ y (1 θ) n y θy (1 θ) n y dθ 0.05 θ , otherwise. Introduction to Bayesian analysis, autumn 2013 University of Tampere 12 / 130

26 1 (continued) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Suppose that the sample size is n = 100 and y = 20 say that they will use the drug. Then the following BUGS code can be used to simulate the posterior. model{ theta ~ dunif(0.05,0.15) y ~ dbin(theta,n) } Suppose that this is the contents of file Acid.txt at the home directory. Then JAGS can be called from R as follows: acid <- list(n=100,y=20) acid.jag <- jags.model("acid1.txt",acid) acid.coda <- coda.samples(acid.jag,"theta",10000) hist(acid.coda[[1]][,"theta"],main="",xlab=expression(theta)) Introduction to Bayesian analysis, autumn 2013 University of Tampere 13 / 130

27 1 (continued) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Frequency Figure 1: Market share of a new drug: Simulations from the posterior of θ. θ Introduction to Bayesian analysis, autumn 2013 University of Tampere 14 / 130

28 2 (Diseased White Pine Trees.) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction White pine is one of the best known species of pines in the northeastern United States and Canada. White pine is susceptible to blister rust, which develops cankers on the bark. These cankers swell, resulting in death of twigs and small trees. A forester wishes to estimate the average number of diseased pine trees per acre in a forest. Introduction to Bayesian analysis, autumn 2013 University of Tampere 15 / 130

29 2 (Diseased White Pine Trees.) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction White pine is one of the best known species of pines in the northeastern United States and Canada. White pine is susceptible to blister rust, which develops cankers on the bark. These cankers swell, resulting in death of twigs and small trees. A forester wishes to estimate the average number of diseased pine trees per acre in a forest. The number of diseased trees per acre can be modeled by a Poisson with mean θ. Since θ changes from area to area, the forester believes that θ Exp(λ). Thus, p(θ) = (1/λ)e θ/λ, if θ > 0,and 0 elsewhere Introduction to Bayesian analysis, autumn 2013 University of Tampere 15 / 130

30 2 (Diseased White Pine Trees.) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction White pine is one of the best known species of pines in the northeastern United States and Canada. White pine is susceptible to blister rust, which develops cankers on the bark. These cankers swell, resulting in death of twigs and small trees. A forester wishes to estimate the average number of diseased pine trees per acre in a forest. The number of diseased trees per acre can be modeled by a Poisson with mean θ. Since θ changes from area to area, the forester believes that θ Exp(λ). Thus, p(θ) = (1/λ)e θ/λ, if θ > 0,and 0 elsewhere The forester takes a random sample of size n from n different one-acre plots. ( from Rohatgi, 2003). Introduction to Bayesian analysis, autumn 2013 University of Tampere 15 / 130

31 2 (continued) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction The likelihood function is n p(y θ) = i=1 n i=1 y i θ y i y i! e θ = θ yi! e nθ. Introduction to Bayesian analysis, autumn 2013 University of Tampere 16 / 130

32 2 (continued) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction The likelihood function is p(y θ) = n i=1 n i=1 y i θ y i y i! e θ = θ yi! Consequently, the posterior is p(θ y) = θ n i=1 y i e θ(n+1/λ) 0 θ n e nθ. i=1 y ie θ(n+1/λ). We see that this is a Gamma- with parameters α = n i=1 y i +1 and β = n+1/λ. Introduction to Bayesian analysis, autumn 2013 University of Tampere 16 / 130

33 2 (continued) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction The likelihood function is p(y θ) = n i=1 n i=1 y i θ y i y i! e θ = θ yi! Consequently, the posterior is p(θ y) = θ n i=1 y i e θ(n+1/λ) 0 θ n e nθ. i=1 y ie θ(n+1/λ). We see that this is a Gamma- with parameters α = n i=1 y i +1 and β = n+1/λ. Thus, p(θ y) = (n+1/λ) n i=1 y i+1 Γ( n i=1 y θ n i=1 y i e θ(n+1/λ). i +1) Introduction to Bayesian analysis, autumn 2013 University of Tampere 16 / 130

34 Statistical decision theory Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction The outcome of a Bayesian analysis is the posterior, which combines the prior information and the information from data. However, sometimes we may want to summarize the posterior information with a scalar, for example the mean, median or mode of the posterior. In the following, we show how the use of scalar estimator can be justified using statistical decision theory. Let L(θ, ˆθ) denote the loss function which gives the cost of using ˆθ = ˆθ(y) as an estimate for θ. We define that ˆθ is a Bayes estimate of θ if it minimizes the posterior expected loss E[L(θ,ˆθ) y] = L(θ,ˆθ)p(θ y)dθ. Introduction to Bayesian analysis, autumn 2013 University of Tampere 17 / 130

35 Statistical decision theory (continued) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction On the other hand, the expectation of the loss function over the sampling of y is called risk function: Rˆθ(θ) = E[L(θ,ˆθ) θ] = L(θ,ˆθ)p(y θ)dy. Introduction to Bayesian analysis, autumn 2013 University of Tampere 18 / 130

36 Statistical decision theory (continued) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction On the other hand, the expectation of the loss function over the sampling of y is called risk function: Rˆθ(θ) = E[L(θ,ˆθ) θ] = L(θ,ˆθ)p(y θ)dy. Further, the expectation of the risk function over the prior of θ, E[Rˆθ(θ)] = Rˆθ(θ)p(θ)dθ, is called Bayes risk. Introduction to Bayesian analysis, autumn 2013 University of Tampere 18 / 130

37 Statistical decision theory (continued) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction By changing the order of integration one can see that the Bayes risk Rˆθ(θ)p(θ)dθ = p(θ) L(θ,ˆθ)p(y θ)dydθ = p(y) L(θ,ˆθ)p(θ y)dθdy (3) is minimized when the inner integral in (3) is minimized for each y, that is, when a Bayes estimator is used. Introduction to Bayesian analysis, autumn 2013 University of Tampere 19 / 130

38 Statistical decision theory (continued) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction By changing the order of integration one can see that the Bayes risk Rˆθ(θ)p(θ)dθ = p(θ) L(θ,ˆθ)p(y θ)dydθ = p(y) L(θ,ˆθ)p(θ y)dθdy (3) is minimized when the inner integral in (3) is minimized for each y, that is, when a Bayes estimator is used. In the following, we introduce the Bayes estimators for three simple loss functions. Introduction to Bayesian analysis, autumn 2013 University of Tampere 19 / 130

39 Bayes estimators: zero-one loss function Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Zero-one loss: L(θ, ˆθ) = { 0 when ˆθ θ < a 1 when ˆθ θ a. Introduction to Bayesian analysis, autumn 2013 University of Tampere 20 / 130

40 Bayes estimators: zero-one loss function Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Zero-one loss: We should minimize L(θ, ˆθ) = L(θ,ˆθ)p(θ y)dθ = { 0 when ˆθ θ < a 1 when ˆθ θ a. ˆθ a =1 ˆθ+a p(θ y)dθ + ˆθ a p(θ y)dθ, ˆθ+a p(θ y)dθ Introduction to Bayesian analysis, autumn 2013 University of Tampere 20 / 130

41 Bayes estimators: zero-one loss function Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Zero-one loss: We should minimize or maximize L(θ, ˆθ) = L(θ,ˆθ)p(θ y)dθ = { 0 when ˆθ θ < a 1 when ˆθ θ a. ˆθ a =1 ˆθ+a ˆθ a ˆθ+a p(θ y)dθ + ˆθ a p(θ y)dθ. p(θ y)dθ, ˆθ+a p(θ y)dθ Introduction to Bayesian analysis, autumn 2013 University of Tampere 20 / 130

42 Bayes estimators: absolute error loss and quadratic loss function Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction If p(θ y) is unimodal, maximization is achieved by choosing ˆθ to be the midpoint of the interval of length 2a for which p(θ y) has the same value at both ends. If we let a 0, then ˆθ tends to the mode of the posterior. This equals the MLE if p(θ) is flat. Introduction to Bayesian analysis, autumn 2013 University of Tampere 21 / 130

43 Bayes estimators: absolute error loss and quadratic loss function Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction If p(θ y) is unimodal, maximization is achieved by choosing ˆθ to be the midpoint of the interval of length 2a for which p(θ y) has the same value at both ends. If we let a 0, then ˆθ tends to the mode of the posterior. This equals the MLE if p(θ) is flat. Absolute error loss: L(θ, ˆθ) = ˆθ θ. In general, if X is a random variable, then the expectation E( X d ) is minimized by choosing d to be the median of the of X. Thus, the Bayes estimate of θ is the posterior median. Introduction to Bayesian analysis, autumn 2013 University of Tampere 21 / 130

44 Bayes estimators: absolute error loss and quadratic loss function Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction If p(θ y) is unimodal, maximization is achieved by choosing ˆθ to be the midpoint of the interval of length 2a for which p(θ y) has the same value at both ends. If we let a 0, then ˆθ tends to the mode of the posterior. This equals the MLE if p(θ) is flat. Absolute error loss: L(θ, ˆθ) = ˆθ θ. In general, if X is a random variable, then the expectation E( X d ) is minimized by choosing d to be the median of the of X. Thus, the Bayes estimate of θ is the posterior median. Quadratic loss function: L(θ, ˆθ) = (ˆθ θ) 2. In general, if X is a random variable, then the expectation E[(X d) 2 ] is minimized by choosing d to be the mean of the of X. Thus, the Bayes estimate of θ is the posterior mean. Introduction to Bayesian analysis, autumn 2013 University of Tampere 21 / 130

45 Bayes estimators: 1 (cont) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction We continue our example of the market share of a new drug. Using R, we can compute the posterior mean and median estimates, and various posterior intervals: summary(acid.coda) 1. Empirical mean and standard deviation for each variable, plus standard error of the mean: Mean SD Naive SE Time-series SE Quantiles for each variable: 2.5% 25% 50% 75% 97.5% Introduction to Bayesian analysis, autumn 2013 University of Tampere 22 / 130

46 Bayes estimators: 1 (cont) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction From Figure 1 we see that the posterior mode is Introduction to Bayesian analysis, autumn 2013 University of Tampere 23 / 130

47 Bayes estimators: 1 (cont) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction From Figure 1 we see that the posterior mode is If we use Beta(α,β), whose density is p(θ) = 1 B(α,β) θα 1 (1 θ) β 1, when 0 < θ < 1, Introduction to Bayesian analysis, autumn 2013 University of Tampere 23 / 130

48 Bayes estimators: 1 (cont) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction From Figure 1 we see that the posterior mode is If we use Beta(α,β), whose density is p(θ) = 1 B(α,β) θα 1 (1 θ) β 1, when 0 < θ < 1, as a prior, then the posterior is p(θ y) p(θ)p(y θ) θ α+y 1 (1 θ) β+n y 1. We see immediately that the posterior is Beta(α+y,β +n y). Introduction to Bayesian analysis, autumn 2013 University of Tampere 23 / 130

49 Bayes estimators: 1 (cont) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction From Figure 1 we see that the posterior mode is If we use Beta(α,β), whose density is p(θ) = 1 B(α,β) θα 1 (1 θ) β 1, when 0 < θ < 1, as a prior, then the posterior is p(θ y) p(θ)p(y θ) θ α+y 1 (1 θ) β+n y 1. We see immediately that the posterior is Beta(α+y,β +n y). The posterior mean (Bayes estimator with quadratic loss) is (α+y)/(α+β +n). The mode (Bayes estimator with zero-one loss when a 0) is (α+y 1)/(α+β +n 2), provided that the is unimodal. Introduction to Bayesian analysis, autumn 2013 University of Tampere 23 / 130

50 Bayes estimators: 2 (cont) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction We now continue our example of estimating the proportion of diseased trees. We derived that the posterior is Gamma( n i=1 y i +1,n+1/λ). Thus, the Bayes estimator with a quadratic loss function is the mean of this, ( n i=1 y i +1)/(n+1/λ). However, the mean and mode of a gamma do not exist in closed form. Introduction to Bayesian analysis, autumn 2013 University of Tampere 24 / 130

51 Bayes estimators: 2 (cont) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction We now continue our example of estimating the proportion of diseased trees. We derived that the posterior is Gamma( n i=1 y i +1,n+1/λ). Thus, the Bayes estimator with a quadratic loss function is the mean of this, ( n i=1 y i +1)/(n+1/λ). However, the mean and mode of a gamma do not exist in closed form. Note that the classical estimate for θ is the sample mean ȳ. Introduction to Bayesian analysis, autumn 2013 University of Tampere 24 / 130

52 Conjugate prior Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Computations can often be facilitated using conjugate prior s. We say that a prior is conjugate for the likelihood if the prior and posterior s belong to the same family. There are conjugate s for the exponential family of sampling s. Introduction to Bayesian analysis, autumn 2013 University of Tampere 25 / 130

53 Conjugate prior Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Computations can often be facilitated using conjugate prior s. We say that a prior is conjugate for the likelihood if the prior and posterior s belong to the same family. There are conjugate s for the exponential family of sampling s. Conjugate priors can be formed with the following simple steps: 1. Write the likelihood function. 2. Remove the factors that do not depend on θ. 3. Replace the expressions which depend on data with parameters. Also the sample size n should be replaced. 4. Now you have the kernel of the conjugate prior. You can complement it with the normalizing constant. 5. In order to obtain the standard parametrization it may be necessary to reparametrize. Introduction to Bayesian analysis, autumn 2013 University of Tampere 25 / 130

54 : Poisson likelihood Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Let y = (y 1,...y n ) be a sample from Poi(θ). Then the likelihood is n θ y ie θ p(y θ) = θ y i e nθ. y i! i=1 Introduction to Bayesian analysis, autumn 2013 University of Tampere 26 / 130

55 : Poisson likelihood Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Let y = (y 1,...y n ) be a sample from Poi(θ). Then the likelihood is n θ y ie θ p(y θ) = θ y i e nθ. y i! i=1 By replacing y i and n, which depend on the data, with the parameters α 1 and α 2, we obtain the conjugate prior p(θ) θ α 1 e α 2θ, which is Gamma(α 1 +1,α 2 ). If we reparametrize this so that α = α 1 +1 and β = α 2 we obtain the prior Gamma(α, β). Introduction to Bayesian analysis, autumn 2013 University of Tampere 26 / 130

56 : Uniform likelihood Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Assume that y = (y 1,...,y n ) is a random sample from Unif(0,θ). The the density of a single observation y i is { 1 p(y i θ) = θ 0 y i θ, 0, otherwise, Introduction to Bayesian analysis, autumn 2013 University of Tampere 27 / 130

57 : Uniform likelihood Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Assume that y = (y 1,...,y n ) is a random sample from Unif(0,θ). The the density of a single observation y i is { 1 p(y i θ) = θ 0 y i θ, 0, otherwise, and the likelihood of θ is { 1 p(y θ) = θ, 0 y n (1)... y (n) θ, 0, otherwise, = 1 θ ni {y (n) θ}(y) I {y(1) 0}(y), where I A (y) denotes an indicator function obtaining value 1 when y A and 0 otherwise. Introduction to Bayesian analysis, autumn 2013 University of Tampere 27 / 130

58 : Uniform likelihood (cont) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Now, by removing the factor I {y(1) 0}(y), which does not depend on θ, and replacing n and y (n) with parameters we obtain p(θ) 1 θ αi {θ β}(θ) { 1 = θ, when θ β, α 0, otherwise. This is the kernel of the Pareto. Introduction to Bayesian analysis, autumn 2013 University of Tampere 28 / 130

59 : Uniform likelihood (cont) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Now, by removing the factor I {y(1) 0}(y), which does not depend on θ, and replacing n and y (n) with parameters we obtain p(θ) 1 θ αi {θ β}(θ) { 1 = θ, when θ β, α 0, otherwise. This is the kernel of the Pareto.The posterior p(θ y) p(θ)p(y θ) { 1, when θ max(β,y θ n+α (n) ) 0, otherwise. is also a Pareto. Introduction to Bayesian analysis, autumn 2013 University of Tampere 28 / 130

60 Noninformative prior Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction When there is no prior information available on the estimated parameters, noninformative priors can be used. They can also be used to find out how an informative prior affects the outcome of the inference. Introduction to Bayesian analysis, autumn 2013 University of Tampere 29 / 130

61 Noninformative prior Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction When there is no prior information available on the estimated parameters, noninformative priors can be used. They can also be used to find out how an informative prior affects the outcome of the inference. The uniform p(θ) 1 is often used as a noninformative prior. However, this is not fully unproblematic. If the uniform is restricted to an interval, it is not, in fact, noninformative. For example, the prior Unif(0, 1), contains the information that θ is in the interval [0.2,0.4] with probability 0.2. This information content becomes obvious when a parametric transformation is made. The of the transformed parameter is no more uniform. Introduction to Bayesian analysis, autumn 2013 University of Tampere 29 / 130

62 Noninformative prior (cont) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Another problem arises if the parameter can obtain values in an infinite interval. In such a case there is no proper uniform. However, one can use an improper uniform prior. Then the posterior is proportional to the likelihood. Introduction to Bayesian analysis, autumn 2013 University of Tampere 30 / 130

63 Noninformative prior (cont) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Another problem arises if the parameter can obtain values in an infinite interval. In such a case there is no proper uniform. However, one can use an improper uniform prior. Then the posterior is proportional to the likelihood. Some parameters, for example scale parameteres and variances, can obtain only positive values. Such variables are often given the improper prior p(θ) 1/θ, which implies that log(θ) has a uniform prior. Introduction to Bayesian analysis, autumn 2013 University of Tampere 30 / 130

64 Noninformative prior (cont) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Another problem arises if the parameter can obtain values in an infinite interval. In such a case there is no proper uniform. However, one can use an improper uniform prior. Then the posterior is proportional to the likelihood. Some parameters, for example scale parameteres and variances, can obtain only positive values. Such variables are often given the improper prior p(θ) 1/θ, which implies that log(θ) has a uniform prior. Jeffreys has suggested giving a uniform prior for such a transformation of θ that its Fisher information is a constant. Jeffreys prior is defined as p(θ) I(θ) 1 2, where I(θ) is the Fisher information of θ. That this definition is invariant to parametrization, can be seen as follows: Introduction to Bayesian analysis, autumn 2013 University of Tampere 30 / 130

65 Noninformative prior (cont) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Let φ = h(θ) be a regular, monotonic transformation of θ, and its inverse transformation θ = h 1 (φ). Then the Fisher information of φ is [ (dlogp(y φ) ) ] 2 I(φ) =E dφ φ [ (dlogp(y θ = h 1 ) 2 ] (φ)) =E dθ φ dθ 2 dφ =I(θ) dθ 2 dφ. Introduction to Bayesian analysis, autumn 2013 University of Tampere 31 / 130

66 Noninformative prior (cont) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Let φ = h(θ) be a regular, monotonic transformation of θ, and its inverse transformation θ = h 1 (φ). Then the Fisher information of φ is [ (dlogp(y φ) ) ] 2 I(φ) =E dφ φ [ (dlogp(y θ = h 1 ) 2 ] (φ)) =E dθ φ dθ 2 dφ =I(θ) dθ 2 dφ. Thus, I(φ) 1 2 = I(Θ) 1 2 dθ. dφ Introduction to Bayesian analysis, autumn 2013 University of Tampere 31 / 130

67 Noninformative prior (cont) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Let φ = h(θ) be a regular, monotonic transformation of θ, and its inverse transformation θ = h 1 (φ). Then the Fisher information of φ is [ (dlogp(y φ) ) ] 2 I(φ) =E dφ φ [ (dlogp(y θ = h 1 ) 2 ] (φ)) =E dθ φ dθ 2 dφ =I(θ) dθ 2 dφ. Thus, I(φ) 1 2 = I(Θ) 1 2 dθ dφ. On the other hand, p(φ) = p(θ) dθ = I(Θ) 1 2 dθ, as required. Introduction to Bayesian analysis, autumn 2013 University of Tampere 31 / 130 dφ dφ

68 Jeffreys prior: s Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Binomial The Fisher information of the binomial parameter θ is I(θ) = n/[(θ(1 θ)]. Thus, the Jeffreys prior is p(θ) [θ(1 θ)] 1/2, which is the Beta(1/2,1/2). Introduction to Bayesian analysis, autumn 2013 University of Tampere 32 / 130

69 Jeffreys prior: s Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Binomial The Fisher information of the binomial parameter θ is I(θ) = n/[(θ(1 θ)]. Thus, the Jeffreys prior is p(θ) [θ(1 θ)] 1/2, which is the Beta(1/2,1/2). The mean of the normal The Fisher information for the mean θ of the normal is I(θ) = n/σ 2. This is independent of θ, so that Jeffreys prior is constant, p(θ) 1. Introduction to Bayesian analysis, autumn 2013 University of Tampere 32 / 130

70 Jeffreys prior: s Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Binomial The Fisher information of the binomial parameter θ is I(θ) = n/[(θ(1 θ)]. Thus, the Jeffreys prior is p(θ) [θ(1 θ)] 1/2, which is the Beta(1/2,1/2). The mean of the normal The Fisher information for the mean θ of the normal is I(θ) = n/σ 2. This is independent of θ, so that Jeffreys prior is constant, p(θ) 1. The variance of the normal Assume that the variance θ of the normal N(µ,θ) is unknown. Then its Fisher information is I(θ) = n/(2θ 2 ), and Jeffreys prior p(θ) 1/θ. Introduction to Bayesian analysis, autumn 2013 University of Tampere 32 / 130

71 Posterior intervals Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Whe have seen that it is possible to summarize posterior information using point estimators. However, posterior regions and intervals are usually more useful. We define that a set C is a posterior region of level 1 α for θ if the posterior probability of θ belonging to C is 1 α: Pr(θ C y) = p(θ y)dθ = 1 α. C Introduction to Bayesian analysis, autumn 2013 University of Tampere 33 / 130

72 Posterior intervals Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Whe have seen that it is possible to summarize posterior information using point estimators. However, posterior regions and intervals are usually more useful. We define that a set C is a posterior region of level 1 α for θ if the posterior probability of θ belonging to C is 1 α: Pr(θ C y) = p(θ y)dθ = 1 α. C In the case of scalar parameters one can use posterior intervals (credible intervals). An equi-tailed posterior inteval is defined using quantiles of the posterior. Thus, (θ L,θ U ) is an 100(1 α)% interval if Pr(θ < θ L y) = Pr(θ > θ U y) = α/2. Introduction to Bayesian analysis, autumn 2013 University of Tampere 33 / 130

73 Posterior intervals Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Whe have seen that it is possible to summarize posterior information using point estimators. However, posterior regions and intervals are usually more useful. We define that a set C is a posterior region of level 1 α for θ if the posterior probability of θ belonging to C is 1 α: Pr(θ C y) = p(θ y)dθ = 1 α. C In the case of scalar parameters one can use posterior intervals (credible intervals). An equi-tailed posterior inteval is defined using quantiles of the posterior. Thus, (θ L,θ U ) is an 100(1 α)% interval if Pr(θ < θ L y) = Pr(θ > θ U y) = α/2. An advantage of this type of interval is that it is invariant with respect to one-to-one parameter transformations. Further, it is easy to compute. Introduction to Bayesian analysis, autumn 2013 University of Tampere 33 / 130

74 Posterior intervals (cont) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction A posterior region is said to be a highest posterior density region (HPD region) if the posterior density is larger in all points of the region than in any point outside the region. This type of region has the smallest possible volume. In a scalar case, an HPD interval has the smallest length. On the other hand, the bounds of the interval are not invariant with respect to parameter transformations, and it is not always easy to determine them. Introduction to Bayesian analysis, autumn 2013 University of Tampere 34 / 130

75 Posterior intervals (cont) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction A posterior region is said to be a highest posterior density region (HPD region) if the posterior density is larger in all points of the region than in any point outside the region. This type of region has the smallest possible volume. In a scalar case, an HPD interval has the smallest length. On the other hand, the bounds of the interval are not invariant with respect to parameter transformations, and it is not always easy to determine them.. Cardiac surgery data. Table 1 shows mortality rates for cardiac surgery on babies at 12 hospitals. If one wishes to estimate the mortality rate in hospital A, denoted as θ A, the simpliest approach is to assume that the number of deaths y is binomially distributed with parameters n and θ A where n is the number of operations in A. Then the MLE is ˆθ A = 0, which sounds too optimistic. Introduction to Bayesian analysis, autumn 2013 University of Tampere 34 / 130

76 Posterior intervals (cont) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction If we give a uniform prior for θ A, then the posterior is Beta(1,48), with posterior mean 1/49. The 95% HPD interval is (0,6.05)% and equi-tailed interval (0.05,7.30)%. Figure 2 shows the posterior density. Another approach would use the total numbers of deaths and operations in all hospitals. Table 1: Mortality rates y/n from cardiac surgery in 12 hospitals (Spiegelhalter et. al, BUGS 0.5 s Volume 1, Cambridge: MRC Biostatistics Unit, 1996). The numbers of deaths y out of n operations. A 0/47 B 18/148 C 8/119 D 46/810 E 8/211 F 13/196 G 9/148 H 31/215 I 14/207 J 8/97 K 29/256 L 24/360 Introduction to Bayesian analysis, autumn 2013 University of Tampere 35 / 130

77 Posterior intervals (cont) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction p(,θ y) Figure 2: Posterior density of θ A when the prior is uniform. The 95% HPD interval is indicated with vertical lines and 95% equitailed interval with red colour. Introduction to Bayesian analysis, autumn 2013 University of Tampere 36 / 130 θ

78 Posterior intervals (cont) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction The following BUGS and R codes can be used to compute the equi-tailed and HPD intervals: model{ theta ~ dbeta(1,1) y ~ dbin(theta,n) } hospital <- list(n=47,y=0) hospital.jag <- jags.model("hospital.txt",hospital) hospital.coda <- coda.samples(hospital.jag,"theta",10000) summary(hospital.coda) HPDinterval(hospital.coda) #Compare with exact upper limit of HPD interval: qbeta(0.95,1,48) [1] Introduction to Bayesian analysis, autumn 2013 University of Tampere 37 / 130

79 Posterior predictive Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction If we wish to predict a new observation ỹ on the basis of the sample y = (y 1,...y n ), we may use its posterior predictive. This is defined to be the conditional of ỹ given y: p(ỹ y) = p(ỹ, θ y)dθ = p(ỹ y, θ)p(θ y)dθ, where p(ỹ y, θ) is the density of the predictive. Introduction to Bayesian analysis, autumn 2013 University of Tampere 38 / 130

80 Posterior predictive Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction If we wish to predict a new observation ỹ on the basis of the sample y = (y 1,...y n ), we may use its posterior predictive. This is defined to be the conditional of ỹ given y: p(ỹ y) = p(ỹ, θ y)dθ = p(ỹ y, θ)p(θ y)dθ, where p(ỹ y, θ) is the density of the predictive. It is easy to simulate the posterior predictive. First, draw simulations θ 1,...,θ L from the posterior p(θ y), then, for each i, draw ỹ i from the predictive p(ỹ y,θ i ). Introduction to Bayesian analysis, autumn 2013 University of Tampere 38 / 130

81 Posterior predictive : Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Assume that we have a coin with unknown probability θ of a head. If there occurs y heads among the first n tosses what is the probability of a head on the next throw? Introduction to Bayesian analysis, autumn 2013 University of Tampere 39 / 130

82 Posterior predictive : Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Assume that we have a coin with unknown probability θ of a head. If there occurs y heads among the first n tosses what is the probability of a head on the next throw? Let ỹ = 1 (ỹ = 0) indicate the event that the next throw is a head (tail). If the prior of θ is Beta(α,β), then p(ỹ y) = = = p(ỹ y, θ)p(θ y)dθ θỹ(1 θ) 1 ỹθα+y 1 (1 θ) β+n y 1 0 B(α+y,β +n y) dθ B(α+y +ỹ,β +n y ỹ +1) B(α+y,β +n y) = (α+y)ỹ(β +n y) 1 ỹ α+β +n Introduction to Bayesian analysis, autumn 2013 University of Tampere 39 / 130.

83 Posterior predictive : (cont) Bayes theorem Prior and posterior s 1 2 Decision theory Bayes estimators 1 2 Conjugate priors Noninformative priors Intervals Prediction Thus, Pr(ỹ = 1 y) = (α+y)/(α+β +n). This tends to the sample proportion y/n as n, so that the role of the prior information vanishes. If n = 10 and y = 4 and prior parameters α = β = 0.5 (Jeffreys prior), the posterior predictive can be simulated with BUGS as follows: model{ theta ~ dbeta(alpha,beta) y ~ dbin(theta,n) ynew ~ dbern(theta) } coin <- list(n=10,y=4,alpha=0.5,beta=0.5) coin.jag <- jags.model("coin.txt",coin) coin.coda <- coda.samples(coin.jag,c("theta","ynew"),10000) summary(coin.coda) Introduction to Bayesian analysis, autumn 2013 University of Tampere 40 / 130

84 Normal Poisson Exponential Introduction to Bayesian analysis, autumn 2013 University of Tampere 41 / 130

85 Normal with known variance Normal Poisson Exponential Next we will consider some simple single-parameter. Let us first assume that y = (y 1,...y n ) is a sample from a normal unknown mean θ and known variance σ 2. The likelihood is then p(y θ) = n i=1 1 2πσ 2 e 1 2σ 2(y i θ) 2 e 1 2σ 2 n i=1 (y i θ) 2 e n 2σ 2(θ ȳ)2. Introduction to Bayesian analysis, autumn 2013 University of Tampere 42 / 130

86 Normal with known variance Normal Poisson Exponential Next we will consider some simple single-parameter. Let us first assume that y = (y 1,...y n ) is a sample from a normal unknown mean θ and known variance σ 2. The likelihood is then p(y θ) = n i=1 1 2πσ 2 e 1 2σ 2(y i θ) 2 e 1 2σ 2 n i=1 (y i θ) 2 e n 2σ 2(θ ȳ)2. By replacing σ 2 /n with τ0 2, and ȳ with µ 0, we find a conjugate prior p(θ) e 1 2τ 0 2 (θ µ 0 ) 2, which is N(µ 0,τ 2 0 ). Introduction to Bayesian analysis, autumn 2013 University of Tampere 42 / 130

87 Normal with known variance (cont) Normal Poisson Exponential With this prior the posterior becomes p(θ y) p(θ)p(y θ) e 1 2τ 2 0 (θ µ 0 ) 2 e n 2σ 2(θ ȳ)2 { exp 1 ( 1 2 τ0 2 + n ) ( σ 2 θ 2 2 { exp 1 } 2τn 2 (θ µ n ) 2, 1 τ 2 0 µ 0 + n )} ȳ σ 2 + n θ σ 2 1 τ 2 0 Introduction to Bayesian analysis, autumn 2013 University of Tampere 43 / 130

Introduction to Bayesian Methods

Introduction to Bayesian Methods Jessi Cisewski Department of Statistics Yale University Sagan Summer Workshop 2016 Our goal: introduction to Bayesian methods Likelihoods Priors: conjugate priors, non-informative