APM 541: Stochastic Modelling in Biology Bayesian Inference. Jay Taylor Fall Jay Taylor (ASU) APM 541 Fall / 53

Similar documents
Molecular Epidemiology Workshop: Bayesian Data Analysis

Markov Chain Monte Carlo (MCMC)

Bayesian Inference and MCMC

Bayesian Methods for Machine Learning

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

CS281A/Stat241A Lecture 22

Bayesian Inference: Concept and Practice

Bayesian Phylogenetics:

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Physics 403. Segev BenZvi. Choosing Priors and the Principle of Maximum Entropy. Department of Physics and Astronomy University of Rochester

Density Estimation. Seungjin Choi

Computational statistics

an introduction to bayesian inference

The binomial model. Assume a uniform prior distribution on p(θ). Write the pdf for this distribution.

Default Priors and Effcient Posterior Computation in Bayesian

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Bayesian Regression Linear and Logistic Regression

The Jeffreys Prior. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) The Jeffreys Prior MATH / 13

Probability and Estimation. Alan Moses

Machine Learning using Bayesian Approaches

A Very Brief Summary of Bayesian Inference, and Examples

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Quantitative Biology II Lecture 4: Variational Methods

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Lecture 7 and 8: Markov Chain Monte Carlo

Inverse Problems in the Bayesian Framework

Conjugate Priors, Uninformative Priors

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

CPSC 540: Machine Learning

CSC321 Lecture 18: Learning Probabilistic Models

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

17 : Markov Chain Monte Carlo

Bayesian Inference in Astronomy & Astrophysics A Short Course

Statistical Theory MT 2007 Problems 4: Solution sketches

Remarks on Improper Ignorance Priors

MCMC Methods: Gibbs and Metropolis

Data Analysis and Uncertainty Part 2: Estimation

Principles of Bayesian Inference

Invariant HPD credible sets and MAP estimators

Introduction to Probabilistic Machine Learning

David Giles Bayesian Econometrics

STA 294: Stochastic Processes & Bayesian Nonparametrics

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Bayesian Inference. Chapter 2: Conjugate models

STAT 425: Introduction to Bayesian Analysis

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

Markov Chain Monte Carlo

PMR Learning as Inference

A Very Brief Summary of Statistical Inference, and Examples

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Bayesian inference: what it means and why we care

Introduction to Applied Bayesian Modeling. ICPSR Day 4

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

A short diversion into the theory of Markov chains, with a view to Markov chain Monte Carlo methods

Introduction to Machine Learning

INTRODUCTION TO BAYESIAN STATISTICS

Introduction to Machine Learning

10. Exchangeability and hierarchical models Objective. Recommended reading

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Adaptive Monte Carlo methods

MARKOV CHAIN MONTE CARLO

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

Learning Bayesian network : Given structure and completely observed data

Markov Networks.

Part III. A Decision-Theoretic Approach and Bayesian testing

Classical and Bayesian inference

Computational Perception. Bayesian Inference

Parametric Techniques Lecture 3

Introduction to Bayesian Statistics

Lecture 6: Markov Chain Monte Carlo

Monte Carlo in Bayesian Statistics

Bayesian RL Seminar. Chris Mansley September 9, 2008

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Statistical Theory MT 2006 Problems 4: Solution sketches

Parameter Learning With Binary Variables

7. Estimation and hypothesis testing. Objective. Recommended reading

Graphical Models and Kernel Methods

Introduc)on to Bayesian Methods

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

Introduction to Bayesian Methods

The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations

Bioinformatics: Biology X

Parametric Techniques

Monte Carlo Methods. Leon Gu CSD, CMU

Answers and expectations

An overview of Bayesian analysis

Modern Methods of Statistical Learning sf2935 Auxiliary material: Exponential Family of Distributions Timo Koski. Second Quarter 2016

ST 740: Model Selection

The Bayesian Choice. Christian P. Robert. From Decision-Theoretic Foundations to Computational Implementation. Second Edition.

Chapter 5. Bayesian Statistics

COS513 LECTURE 8 STATISTICAL CONCEPTS

CS 630 Basic Probability and Information Theory. Tim Campbell

Bayesian philosophy Bayesian computation Bayesian software. Bayesian Statistics. Petter Mostad. Chalmers. April 6, 2017

Bayesian Inference: Posterior Intervals

Transcription:

APM 541: Stochastic Modelling in Biology Bayesian Inference Jay Taylor Fall 2013 Jay Taylor (ASU) APM 541 Fall 2013 1 / 53

Outline Outline 1 Introduction 2 Conjugate Distributions 3 Noninformative priors 4 MCMC Jay Taylor (ASU) APM 541 Fall 2013 2 / 53

Introduction Bayesian Statistics: Overview Bayesian statistics is a collection of methods that rely on two basic principles: 1 Probability is interpreted as a measure of uncertainty, whatever the source. Thus, in a Bayesian analysis, it is standard practice to assign probability distributions not only to unseen data, but also to parameters, models, and hypotheses. 2 Bayes formula is used to update the probability distributions representing uncertainty as new data are acquired. Recall that if A and B are two events with positive probabilities P(A) > 0 and P(B) > 0, then Bayes formula asserts that P(B A) = P(B) P(A B) P(A). Jay Taylor (ASU) APM 541 Fall 2013 3 / 53

Introduction Suppose that our goal is to use some newly acquired data to estimate the value of an unknown parameter Θ. A Bayesian treatment of this problem would consist of the following steps: 1 We first need to formulate a statistical model which specifies the conditional distribution of the data on each possible set of parameter values. This is specified by the likelihood function, p(d Θ = θ). 2 We then choose a prior distribution, p(θ), for the unknown parameter which quantifies how strongly we believe that Θ = θ before we examine the new data. 3 We then examine the new data D. 4 In light of this new data, we use Bayes formula to revise our beliefs concerning the value of the parameter Θ: p(θ D) = p(θ) p(d θ) p(d). The quantity p(θ D) is called the posterior distribution of the parameters θ given the data. p(d) is sometimes called the marginal likelihood or the prior predictive probability. Jay Taylor (ASU) APM 541 Fall 2013 4 / 53

Introduction Example: Sex ratios (at birth) vary widely in the animal kingdom and some species of invertebrates have extremely skewed sex ratios, with predominantly female broods. Suppose that we are studying a newly described species that we suspect of having a female-biased sex ratio. For the sake of the example, we will make the following simplifying assumptions: The probability of being born female does not vary within or between broods. In particular, there is no environmental or heritable variation in sex ratio. The sexes of the different members of a brood are determined independently of one another. We will let Θ denote the unknown probability that an individual is born female. Conditional on Θ = θ, the sex ratio at birth will be θ/1 θ. To estimate the birth sex ratio we observe b broods and determine that the i th brood contains f i females and m i males. Let n be the total number of offspring in all b broods and let f and m be the total numbers of females and males among them. Jay Taylor (ASU) APM 541 Fall 2013 5 / 53

Introduction We need to specify both the likelihood function and the prior distribution. Under our assumptions about sex determination, the likelihood function depends only on the sex ratio and the total numbers of males and females in the broods. Conditional on Θ = θ, the total number of female offspring among the n offspring is binomially distributed with parameters n and θ:! P(f Θ = θ) = n θ f (1 θ) n f. f We need to select a prior distribution on Θ and this should reflect what we have previously learned about the birth sex ratio in this species. This could be determined by prior observations of this species or by information about the sex ratio in closely-related species. Also, since the value of Θ is a probability, our prior distribution should be supported on the interval [0, 1]. Here I will let the prior distribution be the Beta distribution with parameters 2 and 1: p(θ) = 2θ. This places greater weight on more female-skewed sex ratios (θ > 1/2) without ruling out the possibility of an even sex ratio. Jay Taylor (ASU) APM 541 Fall 2013 6 / 53

Introduction We are now ready to use Bayes formula to calculate the posterior distribution of Θ conditional on having observed f female offspring out of a total of n offspring: p(θ = θ f, m) = 2θ `n θf (1 θ) m f P(f, m) = 1 C θf +1 (1 θ) m, where C is a normalizing constant that does not depend on x. This we recognize as another Beta distribution, now with parameters f + 2 and m + 1, i.e., p(θ = θ f, m) = 1 β(f + 2, m + 1) θf +1 (1 θ) m. Remark: Notice that we did not have to evaluate the marginal likelihood P(f, m) in this example because it was absorbed into the normalizing constant. Jay Taylor (ASU) APM 541 Fall 2013 7 / 53

Introduction It is often good practice to investigate the sensitivity of the posterior distribution to different choices of the prior distribution. For example, if we knew nothing about the birth sex ratios in the taxon containing our new species, then we might take the prior distribution to be uniform on [0, 1]. In this case, the posterior distribution of Θ would be proportional to the likelihood, giving p(θ = θ f, m) = `n θf (1 θ) m f P(f, m) = θ f (1 θ) m β(f + 1, m + 1), i.e., the posterior distribution is the beta distribution with parameters f + 1 and m + 1. Alternatively, if we have grounds to believe that the sex ratio is even, then we might take the prior distribution to be the Beta distribution with parameters 5 and 5, in which case the posterior distribution is still a Beta distribution, but with parameters f + 5 and m + 5: p(θ = θ f, m) = θ4 (1 θ) 4 β(5, 5) `n θf (1 θ) m f P(f, m) = θ f +4 (1 θ) m+4 β(f + 5, m + 5). Jay Taylor (ASU) APM 541 Fall 2013 8 / 53

Introduction The figures show the posterior distributions corresponding to each of these three priors for two different data sets: either f = 7, m = 3 (left plot) or f = 70, m = 30 (right plot). 4 3.5 Posterior Distribution: f =7, m=3 Beta(2,1) Beta(1,1) Beta(5,5) 9 8 Posterior Distribution: f =70, m=30 Beta(2,1) Beta(1,1) Beta(5,5) 3 7 posterior density 2.5 2 1.5 1 0.5 0 0 0.2 0.4 0.6 0.8 1 posterior density 6 5 4 3 2 1 0 0 0.2 0.4 0.6 0.8 1 Jay Taylor (ASU) APM 541 Fall 2013 9 / 53

Introduction Remarks: The posterior distribution depends on both the prior distribution and the data, but usually to different degrees. If there is little data or if the data is not very informative about the unknown parameters, then the posterior distribution will be dominated by the prior distribution. Even if the data is both abundant and very informative, this will be reflected by the posterior distribution only if the prior distribution is sufficiently diffuse. Cromwell s rule: The prior distribution should only assign a probability of 0 to events that are logically false. The information content of the data can be evaluated by determining how different the posterior distribution is from the prior distribution. This is often done using the Kullback-Leibler divergence. Jay Taylor (ASU) APM 541 Fall 2013 10 / 53

Introduction Summaries of the Posterior Distribution While the posterior distribution comprehensively represents the state of our knowledge concerning Θ once we have acquired the data, it is sometimes convenient to have simple numerical summaries of this distribution. Point estimators of Θ include the mean and the median of the posterior distribution. The maximum a posteriori (MAP) estimate of Θ is the value θ MAP that maximizes the posterior distribution. This is comparable to the maximum likelihood estimate, but is less robust to approximation error. The spread of the posterior distribution is often summarized by the standard deviation or by the α/2 and 1 α/2 quantiles (e.g., with α = 0.05). The posterior distribution can also be summarized by specifying the 100(1 α)% highest probability density (HPD) region. Jay Taylor (ASU) APM 541 Fall 2013 11 / 53

Introduction The table lists various summary statistics for the posterior distributions calculated in the sex ratio example. data prior mean median sd 2.5% 97.5% 7, 3 β(2, 1) 0.692 0.702 0.123 0.428 0.901 β(1, 1) 0.667 0.676 0.131 0.390 0.891 β(5, 5) 0.600 0.603 0.107 0.384 0.798 70, 30 β(2, 1) 0.699 0.700 0.045 0.608 0.783 β(1, 1) 0.697 0.697 0.045 0.604 0.781 β(5, 5) 0.682 0.683 0.044 0.592 0.765 Interpretation: Whereas the small data set does not allow us to reject the hypothesis that the sex ratio is 1 : 1, all three analyses of the large data set suggest that the probability of being born female is greater than 0.59. Jay Taylor (ASU) APM 541 Fall 2013 12 / 53

Introduction Acquiring New Data One of the strengths of Bayesian statistics is that it can easily handle sequentially acquired data. Example: Suppose that in our first experiment we collected five broods totaling 7 females and 3 males. As we can see on the previous slide, all three posterior distributions are fairly broad and only weakly suggest that the sex ratio is female-biased. Suppose that we then conduct a second experiment, independent of the first, during which we observe 15 females and 5 males. We can again use Bayes formula to account for this new data, but now we use the former posterior distribution as our new prior distribution: p(θ = θ f = 15, m = 5) = θ8 (1 θ) 3 β(9, 4) `20 θ15 (1 θ) 5 15 P(f = 15, m = 5) = θ23 (1 θ) 8 β(24, 9). If we were to then conduct a third experiment, we would use the posterior distribution obtained from the second as the new prior and so forth. Jay Taylor (ASU) APM 541 Fall 2013 13 / 53

Introduction Criticisms A number of objections have been raised against Bayesian methodology. Some critics object to the use of probabilities to quantify uncertainty in belief or plausibility. In particular, in the frequentist interpretation of probability, probabilities are only defined when they can be calculated as limiting frequencies in a sequence of independent trials. This stance largely rules out assigning probabilities to unknown parameters or models. Another criticism is that Bayesian methods are not objective because inference depends on a prior distribution, which can be chosen subjectively. A Bayesian rejoinder would be that the likelihood function, which is also used in frequentist analyses, is no less subjective than the prior. A third criticism, which is less relevant nowadays, is that Bayesian methods are computationally demanding. In the past this was often an insurmountable obstacle to conducting a Bayesian analysis. Jay Taylor (ASU) APM 541 Fall 2013 14 / 53

Conjugate Distributions Conjugate Prior Distributions In the example we saw that both the prior distribution and the posterior distribution were beta distributions that differed only in the values of their parameters. Although the prior and posterior distribution will only belong to the same functional family under special circumstances, this situation is important enough in practice to have been studied in depth. Definition Suppose that F is a set of likelihood functions and that P is a set of prior distributions for θ. Then P is said to be conjugate for F if whenever we choose a likelihood function p(x θ) F and a prior distribution p(θ) P, then the posterior distribution also belongs to P. p(θ x) = p(θ) p(x θ) p(x) Jay Taylor (ASU) APM 541 Fall 2013 15 / 53

Conjugate Distributions Since we can always render a prior distribution conjugate for any set of likelihood functions simply by taking P sufficiently large, the previous definition is almost too general to be useful. However, there is an important special case when conjugacy becomes non-trivial. Definition A parametric family of distributions F on a set E R n is said to be an exponential family if the density of every member of the family has the following form! mx p(x θ) = h(x) exp η i (θ)t i (x) ψ(θ), x E where θ = (θ 1,, θ k ) Ω R k is the vector-valued parameter; i=1 ψ : Ω R is the log-partition function; h : E [0, ]; η = (η 1,, η m) : Ω R m is the vector-valued natural parameter; T = (t 1,, t m) : E R m is a vector of the same dimension as η. T (x) is said to be a sufficient statistic for the distribution. Jay Taylor (ASU) APM 541 Fall 2013 16 / 53

Conjugate Distributions Many familiar parametric distributions belong to exponential families. Example: The exponential distributions are an exponential family since we can write the density as: where E = Ω = [0, ); θ = λ; h(x) = 1 and ψ(λ) = ln(λ); η 1(λ) = λ and t 1(x) = x. p(x λ) = λe λx = e λ x+ln(λ) Jay Taylor (ASU) APM 541 Fall 2013 17 / 53

Conjugate Distributions Example: The normal distributions are an exponential family since we can write the density as: In this case, E = R; p(x µ, σ) = 1 2πσ 2 e (x µ)2 /2σ 2 µ = exp σ x 1 2 2σ x 2 µ2 2 2σ 1 «2 2 ln(2πσ2 ). Ω = R (0, ) with θ 1 = µ and θ 2 = σ; h(x) = 1; ψ(µ, σ) = µ 2 /2σ 2 + ln(2πµ 2 )/2; η 1(µ, σ) = µ/σ 2 and η 2(µ, σ) = 1/(2σ 2 ); t 1(x) = x and t 2(x) = x 2. Jay Taylor (ASU) APM 541 Fall 2013 18 / 53

Conjugate Distributions Example: For each fixed n 1, the binomial distributions are an exponential family: where p n(x p) = E = {0, 1,, n} R; Ω = [0, 1] with θ = p; h(x) = `n x and ψ(p) = n ln(1 p); η 1(p) = ln(p/(1 p)) and t 1(x) = x.! ««n p exp ln x + n ln(1 p). x 1 p Remark: The binomial distributions do not form an exponential family if n and p are both allowed to vary. In effect, this is because there is no way to factor `n x into a product of a function of n and a function of x. Jay Taylor (ASU) APM 541 Fall 2013 19 / 53

Conjugate Distributions Every Exponential Family has a Parametric Conjugate Prior Distribution Our interest in exponential families is motivated by the following observation. Suppose that the likelihood function belongs to an exponential family F with densities p(x θ) = h(x) exp! mx η i (θ)t i (x) ψ(θ) and let P be the set of prior distributions on θ with densities of the form i=1 p(θ; a 1,, a m, d) 1 C exp m X i=1 a i η i (θ) dψ(θ)!, where a 1,, a n and d are constants and C = C(a 1,, a m, d) is a normalizing constant. Notice that P is itself a parametric family with an (m + 1)-dimensional parameter. Jay Taylor (ASU) APM 541 Fall 2013 20 / 53

Conjugate Distributions Then the posterior distribution is p(θ x) = 1 m! C h(x) exp X 1 η i (θ)(a i + t i (x)) (d + 1)ψ(θ) p(x) i=1 = 1 m! C exp X η i (θ)(a i + t i (x)) (d + 1)ψ(θ) i=1 = p(θ; a 1 + t 1(x),, a m + t m(x), d + 1). This shows that the posterior distribution is also a member of P and therefore that P is conjugate for F. In this case, the posterior distribution can be calculated simply by updating the parameters of the prior distribution: a i a i + t i (x), i = 1, m, d d + 1. Because this updating usually requires much less computation than direct evaluation of the marginal likelihood of the data, p(x), conjugate priors are often used in problems where computational efficiency is at a premium. Jay Taylor (ASU) APM 541 Fall 2013 21 / 53

Conjugate Distributions Example: A conjugate prior for the exponential distribution, p(x λ) = λe λx, is p(λ; a, d) = 1 C e aλ d ln(λ) = ad+1 Γ(d + 1) λd e aλ, which we recognize as the Gamma distribution with scale parameter a and shape parameter d + 1. If this is used as the prior distribution, then the posterior distribution given data x is: p(λ x) = (a + x)d+2 Γ(d + 2) λd+1 e (a+x)λ, which is a Gamma distribution with scale parameter a + x and shape parameter d + 2. Jay Taylor (ASU) APM 541 Fall 2013 22 / 53

Conjugate Distributions Example: A conjugate prior for the binomial distribution, p n(x p) = `n x px (1 p) n x, with fixed n is p n(p; a, d) = 1 ««p C (1 p)nd exp (a 1) ln + dn ln(1 p) 1 p = 1 C pa 1 (1 p) nd a+1, where C = C(a, d, n) is a normalizing constant. If we take d = (a + b 2)/n, then this is just a Beta distribution with parameters a and b and then the posterior distribution is p n(p x) = 1 β(a + x, b + n x) pa+x 1 (1 p) b+n x 1, which is a Beta distribution with parameters a + x and b + n x. Jay Taylor (ASU) APM 541 Fall 2013 23 / 53

Conjugate Distributions Example: A conjugate prior for the normal distribution p(x µ, σ 2 ) = 1 2πσ 2 e (x µ)2 /2σ 2, is p(µ, σ 2 ) = 1 C exp m(2a + 3) µ σ + b 1 2 2σ j b a = Γ(a) (σ2 ) a 1 exp bσ 2 «ff µ 2 (2a + 3) 2 j 1 2πσ 2 exp 2σ 1 ««2 2 ln(2πσ2 ) «ff. (µ m)2 2σ 2 This is a special case of the so-called normal-inverse-gamma distribution, denoted N Γ 1, with parameters m, 1, a and b. If this is used the prior distribution, then the posterior distribution given a single observation x also has the normal-inverse-gamma distribution, but now with parameters (m + x)/2, 2, a + 1/2, and b + (x m) 2 /4. Jay Taylor (ASU) APM 541 Fall 2013 24 / 53

Conjugate Distributions Although the normal-inverse-gamma distribution appears to be complicated, it has a fairly straightforward interpretation. Suppose that the random vector (µ, σ 2 ) is sampled in the following way: 1 σ 2 Gamma(a, b) µ σ 2 N (m, σ 2 /λ). Then (µ, σ 2 ) N Γ 1 (m, λ, a, b). In other words, we first sample the precision τ 2 = 1/σ 2 from a Gamma distribution with parameters a and b. Then, having sampled σ 2, we next sample µ from a normal distribution with mean m and variance σ 2 /λ. Jay Taylor (ASU) APM 541 Fall 2013 25 / 53

Noninformative priors Noninformative Priors When there is little prior information concerning the value of an unknown parameter, some practitioners recommend using a so-called noninformative or reference prior distribution. Ostensibly, these distributions are chosen so that the posterior distribution only depends on the data, although whether this is ever really the case is debated. Several methods for selecting uninformative priors have been proposed: Principle of indifference; Maximum entropy priors; Jeffreys priors; Reference priors. Jay Taylor (ASU) APM 541 Fall 2013 26 / 53

Noninformative priors Principle of Indifference Suppose that we wish to assign a prior distribution p to a finite set E = {e 1,, e n} and that we have no reason to believe that any one of the outcomes is more likely than any of the others. In this case, the principle of indifference states that the uniform distribution on E should be chosen as the prior distribution, i.e., p(e i ) = 1, i = 1,, n. n Example: If we are asked to predict the outcome of a coin toss and we know nothing about the coin, then according to the principle of indifference our prior distribution should assign probability 1/2 to heads and probability 1/2 to tails. Jay Taylor (ASU) APM 541 Fall 2013 27 / 53

Noninformative priors Maximum Entropy Priors There are several different concepts of entropy, but each of these is meant to quantify the amount of uncertainty in a distribution. Definition Suppose that P is a probability distribution on a set E. 1 If E = {e 1, e 2, } is discrete and P has probability mass function p(x), then the Shannon entropy of P is the quantity H(P) X i p(e i ) log b (p(e i )), where the base b is usually taken to be 2, e or 10. Notice that changing b simply changes the units in which the entropy is expressed. 2 If E R n is an open set and P is a continuous distribution on E with density p(x), then the differential entropy of P is the quantity Z H(P) p(x) log b (p(x))dx. E Jay Taylor (ASU) APM 541 Fall 2013 28 / 53

Noninformative priors One weakness of the previous definition (particularly in the case of the differential entropy) is that the entropies defined there depend on the parametrization of E, i.e., if ψ : E E is a one-to-one and onto function mapping E to E and P is the probability distribution induced on E by this transformation, then in general H(P) H(P ). This is problematic because in some sense P and P are the same distribution. All that we have done is to relabel the points of E using the transformation E. Accordingly, if the entropy is meant to be a measure of the uncertainty associated with a distribution, then P and P should have the same entropy. Jay Taylor (ASU) APM 541 Fall 2013 29 / 53

Noninformative priors In light of this issue, it is often useful to associate a reference measure m with E which is sometimes called the invariant measure of E. For example, if E possesses a natural set of symmetries (e.g., translations of R or rotations of S 1 ), then m might be chosen to be a measure that is invariant under these symmetry transformations. In this case, we have the following more general definition of entropy due to Jaynes. Definition Suppose that E is equipped with an invariant measure m and that P is a probability distribution on E that has a density f with respect to m, i.e., for every subset A E Z P(A) = f (x)m(dx). Then the entropy of P (with respect to m) is the quantity Z H c(p) log (f (x)) P(dx). E A It can be shown that H c(p) is not changed by a reparameterization of the sample space. Jay Taylor (ASU) APM 541 Fall 2013 30 / 53

Noninformative priors The Maximum Entropy Principle Problem: Suppose that the set E is equipped with an invariant measure m and that our goal is to choose a prior distribution on E that is consistent with some body of prior information. Let P be the collection of probability distributions on E which have densities with respect to m and which are consistent with this information. How do we choose a prior distribution from this set? Solution: According to the maximum entropy principle, we should select the distribution from P that has maximum entropy with respect to m, i.e., choose P maxent such that H c(p maxent) H c(p) P P. Rationale: We should choose a prior distribution that is consistent with our prior knowledge without introducing any hidden assumptions. According to the maximum entropy principle, the right way to do this is to choose the distribution which maximizes uncertainty subject to the prior constraints. This is a version of Occam s razor. Jay Taylor (ASU) APM 541 Fall 2013 31 / 53

Noninformative priors Maximum Entropy Distributions: Discrete Case Suppose that E = {e 1, e 2, } is a discrete set and let m be a measure on E. We will assume that our prior knowledge is such that P is the set of all probability distributions on E such that X f k (e)p(e) = f k e E where for each k = 1,, r, f k : E R is a real-valued function and f k is a constant. In this case, the maximum entropy distribution will have the form p(e) = 1 Z m(e) exp ( ) rx λ k f k (e) where λ 1,, λ k are Lagrange multipliers that must be chosen so that the r constraints are satisfied and Z is a normalizing constant. It can be shown that this distribution is a member of an exponential family and also a Gibbs measure on E. k=1 Jay Taylor (ASU) APM 541 Fall 2013 32 / 53

Noninformative priors Example: Suppose that: E = {e 1,, e n} is finite; the invariant measure is given by a function m : E [0, ) with positive total mass; we have no prior knowledge and thus P is simply the collection of all probability distributions on E. Because there are no constraints (besides normalization), there are no Lagrange multipliers to be solved for in this case and so the maximum entropy distribution is simply!. nx p(e i ) = m(e i ) m(e i ) In particular, if m is constant, then the maximum entropy distribution is just the uniform distribution on E, in which case the maximum entropy principle and the principle of indifference lead to same choice of prior distribution. i=1 Jay Taylor (ASU) APM 541 Fall 2013 33 / 53

Noninformative priors Maximum Entropy Distributions: Continuous Case Suppose that E = (a, b) R is an interval and let m be a measure on E. We will assume that P is the set of probability distributions on E which have densities with respect to m and which satisfy the constraints Z f k (x)p(x)m(dx) = f k where for each k = 1,, r, f k : E R is a real-valued function and f k is a constant. In this case, the maximum entropy distribution will have a density (with respect to m) of the form p(x) = 1 Z exp ( ) rx λ k f k (x) where the constants λ 1,, λ k must be chosen so that the r constraints are satisfied and Z is a normalizing constant. As in the discrete case, this distribution is both a member of an exponential family and a Gibbs measure on E. k=1 Jay Taylor (ASU) APM 541 Fall 2013 34 / 53

Noninformative priors Examples: If E = [a, b] is compact, m(dx) = dx, and there are no prior constraints, then the maximum entropy distribution on E is the uniform distribution. If E = [0, ), m(dx) = dx, and P is the collection of continuous distributions on E with mean µ > 0, then the maximum entropy distribution in this set is the exponential distribution with rate parameter µ 1. If E = R, m(dx) = dx, and P is the collection of continuous distributions on E with mean µ and variance σ 2 <, then the maximum entropy distribution in this set is the normal distribution with this mean and variance. If E = R, m(dx) = dx, and P is the collection of continuous distributions on E with mean µ, there is no maximum entropy distribution. Indeed, since the entropy of the normal distribution with mean µ and variance σ 2 is ln(2πeσ 2 )/2, we can construct a sequence of distributions in P with entropy tending to. Jay Taylor (ASU) APM 541 Fall 2013 35 / 53

Noninformative priors Jeffreys Priors As the last example illustrates, not every class of probability distributions contains a maximum entropy distribution. This is especially likely if there are no prior constraints on that class. In lieu of the maximum entropy principle, we desire a rule that will choose an uninformative prior in a manner that is invariant under reparameterization. That is, if we have a procedure Π that chooses a distribution Π E on each space E, then for any bijection ψ : E E we want the following identity to hold: Π ψ(e) = Π E ψ 1, i.e., we should arrive at the same distribution no matter whether we first choose the prior distribution and then reparameterize or we first reparameterize and then select the prior distribution. It turns out that there is essentially a unique solution to this problem, which was described by Harold Jeffreys, and leads to a distribution called Jeffreys prior. Jay Taylor (ASU) APM 541 Fall 2013 36 / 53

Noninformative priors Before we can describe the Jeffreys prior, we need to introduce the Fisher information of a parametric family of distributions. Definition Suppose that p(x; θ) is a probability distribution on a set E that depends smoothly on a parameter θ Ω R d. Then the Fisher information of p is the d d matrix I(θ) with entries» ««θ (I(θ)) i,j = E log p(x ; θ) log p(x ; θ), θ i θ j where X is a random variable with the distribution p(x; θ). Interpretation: The Fisher information measures the expected curvature of the log-likelihood of the data with respect to the parameter θ. Its components measure how much information the data contains, on average, about the values of the components of the parameter. Jay Taylor (ASU) APM 541 Fall 2013 37 / 53

Noninformative priors Now suppose that F is the collection of likelihood functions of the form p(x; θ), where θ Ω R d and, as on the previous slide, we assume that p is a smooth function of the parameter. Then the Jeffreys prior distribution on θ is the distribution on Ω that is proportional to the square root of the determinant of the Fisher information: p(θ) p det I(θ) It can be shown that this procedure is invariant under reparameterizations: if we transform θ to φ = φ(θ), then the Jeffreys prior for φ will be the same distribution as the transformation of the Jeffreys prior for θ. Jay Taylor (ASU) APM 541 Fall 2013 38 / 53

Noninformative priors Example: Suppose that the likelihood function is given by the binomial distribution with fixed n and variable θ:! p(x; θ) = n θ x (1 θ) n x. x In this case, the log-likelihood function is log(p(x; θ)) = C + x log(θ) + (n x) log(1 θ), where C is a constant that does not depend on θ. Its derivative with respect to θ is d dθ log(p(x; θ)) = x θ n x 1 θ and consequently the Fisher information is I(θ) = E " X θ n X «# 2 1 θ = n θ(1 θ). Jay Taylor (ASU) APM 541 Fall 2013 39 / 53

Noninformative priors It follows that for a binomial likelihood function, with fixed sample size n and unknown success probability θ, the Jeffreys prior on θ is p(θ) = 1 C p I(θ) = 1 π p θ(1 θ), which is just the Beta distribution with parameters 1/2 and 1/2. Notably, this is not the maximum entropy distribution on [0, 1] (which is just the standard uniform distribution), which demonstrates that these two principles lead to different choices of uninformative prior distributions even in relatively simple settings. Jay Taylor (ASU) APM 541 Fall 2013 40 / 53

Noninformative priors Examples: If the likelihood is normal with unknown mean µ and known variance σ 2, then the Fisher information is I(µ) = 1 σ 2, while the Jeffreys prior for µ is p(µ) 1 σ 1 which is just Lebesgue measure on R. This is not normalizable and thus is said to be an improper prior. In this case, the use of this improper prior still leads to a proper posterior distribution, but one must be very cautious about the use of such prior distributions. If the likelihood is normal with known mean µ and unknown standard deviation σ, then the Fisher information is I(σ) = 2 σ 2, while the Jeffreys prior for σ is p(σ) 1 σ for σ > 0. Again, this is an improper prior distribution on [0, ). Jay Taylor (ASU) APM 541 Fall 2013 41 / 53

MCMC Markov Chain Monte Carlo Methods Until recently, Bayesian methods were regarded as impractical for many problems because of the computational difficulties of evaluating the posterior distribution. In particular, to use Bayes formula, p(θ D) = p(θ) p(d θ) p(d), we need to evaluate the marginal probability of the data Z p(d) = p(d θ)p(θ)dθ, which requires integration of the likelihood function over the parameter space. Except in special cases, this integration must be performed numerically and often this is over a space with dozens or hundreds of dimensions. Jay Taylor (ASU) APM 541 Fall 2013 42 / 53

MCMC An alternative to direct evaluation of Bayes formula is to use Monte Carlo methods to sample from the posterior distribution. Ideally, we would like to generate a collection of independent samples, Θ 1,, Θ N with Θ i p(θ D), which can then be used to estimate expectations and probabilities under p(θ D): Z f (θ)p(θ D)dθ 1 N Z A p(θ D)dθ 1 N NX f (Θ i ) i=1 NX 1 A (Θ i ). i=1 In effect, we are using the empirical distribution of the sample to estimate p(θ D): p(θ D) 1 N NX δ Θi. i=1 Jay Taylor (ASU) APM 541 Fall 2013 43 / 53

MCMC One way to sample from a given distribution is to construct a Markov chain that has this distribution as its stationary distribution. This idea leads to a set of sampling techniques which are collectively known as Markov chain Monte Carlo (MCMC) methods. Suppose that we can formulate a discrete-time Markov chain (Θ t : t 0) which has p(θ D) as a stationary distribution. If Θ is ergodic, then whatever the initial value of Θ 0, the distribution of Θ t will tend to p(θ D). Furthermore, by sampling the chain at sufficiently widely spaced intervals, say Θ T, Θ 2T, Θ 3T,, we will obtain a collection of samples that are approximately independent. Formulations include the Metropolis-Hastings algorithm and the Gibbs Sampler. Jay Taylor (ASU) APM 541 Fall 2013 44 / 53

MCMC The Metropolis-Hastings Algorithm The Metropolis-Hastings algorithm provides a recipe for constructing a Markov chain with a specified stationary distribution on a state space E. It requires the following ingredients: a target distribution with density π(x) on E; a stochastic kernel Q(x; y) on E E: in other words, for each x E, Q(x; y) is the density of a probability distribution on E called the proposal density; a sequence of i.i.d. standard uniform random variables, U 1, U 2,. The aim is to construct a Markov chain X = (X t : t 0) with stationary distribution π. Remark: If we need to sample discrete random variables, then we can interpret each of the above probability densities as a probability mass function. (More broadly, we can assume that all of the densities are defined with respect to a common reference measure on E.) Jay Taylor (ASU) APM 541 Fall 2013 45 / 53

MCMC The Metropolis-Hastings algorithm consists of repeated application of the following three steps. Suppose that X n = x is the current state of the Markov chain. Then the next state is chosen as follows: Step 1: We first propose a new value for the chain by sampling Y n+1 = y from the density Q(x; y). Step 2: We then calculate the acceptance probability of the new state: j ff π(y)q(y; x) α(x; y) = min π(x)q(x; y), 1 Step 3: If U n+1 α(x, y), then we set X n+1 = Y n+1. Otherwise, we set X n+1 = X n. Jay Taylor (ASU) APM 541 Fall 2013 46 / 53

MCMC Interpretation: Notice that the acceptance probability can be written as α(x; y) = min{r MH (x; y), 1}, where r MH (x; y) is the Metropolis-Hastings ratio r MH (x; y) = π(y)q(y; x) π(x)q(x; y) This shows that the algorithm will tend to accept moves from x to y either when: the target density is larger at y than at x, i.e., π(y) > π(x), or the proposal density favors transitions from x to y over transitions from y to x. Moves of the first kind occur so that the Markov chain will spend a disproportionate amount of time in those parts of the state space where the target density is large. In contrast, moves of the second kind occur because the algorithm must compensate for the bias imposed by the proposal distribution. Jay Taylor (ASU) APM 541 Fall 2013 47 / 53

MCMC Theorem If X is the Markov chain described on the previous slide, then π is a stationary distribution of the chain X. Proof: We need to show that if X n has density π(x), then so does X n+1. We first observe that the transition function of this Markov chain is 8 >< Q(x; y)α(x; y)dy if y x P(x; dy) = R >: Q(x; z)`1 α(x; z) dz δ E x(dy) if y = x, where δ x(dy) is the degenerate probability distribution that assigns probability 1 to the point x. The transition function has two parts because the Markov chain can either jump to the new proposed state (top line) or it can reject the proposed state and remain in its current location (second line). Jay Taylor (ASU) APM 541 Fall 2013 48 / 53

MCMC Next, for each y E, let E y = {x E : r MH (y; x) < 1} E y = {x E : r MH (x; y) < 1}, and notice that since r MH (y; x) = 1/r MH (x; y), we have E y = (E y ) c. Then, Z x E π(x)p(x; dy)dx = Z π(y)q(y; x) π(x)q(x; y) E y π(x)q(x; y) dx Z! + π(x)q(x; y)dx dy (E y ) c Z + π(y)dy Q(y; z) 1 E y «dy «π(z)q(z; y) dz π(y)q(y; z) Jay Taylor (ASU) APM 541 Fall 2013 49 / 53

MCMC = = = Z «π(y)q(y; x)dx dy E y Z! + π(x)q(x; y)dx dy (E y ) c Z! + `π(y)q(y; z) π(z)q(z; y) dz dy (E y ) c Z! π(y)q(y; z)dz dy Z E y E c y E = π(y)dy. Q(y; z)dz «π(y)dy Jay Taylor (ASU) APM 541 Fall 2013 50 / 53

MCMC The Proposal Distribution Although there is no universally-valid rule to tell us how to choose the proposal distribution, the following considerations are important: Given any two sets A, B E with π(a) > 0, π(b) > 0, it should be possible to go from A to B in finitely many moves. If this condition is violated, then the chain will not converge to the stationary distribution. It should be easy to sample from Q(x; y). Q should be chosen so that the chain has good mixing properties, i.e., the chain should rapidly approach equilibrium when started from any x E: P{X n dy X 0 = x} π(dy). One practical guideline is that the acceptance probability α should be on the order of 0.2 0.6. It is often a good idea to try out different proposal distributions to identify one with good properties. Jay Taylor (ASU) APM 541 Fall 2013 51 / 53

MCMC Initialization and Burn-In Because the early values of a MCMC simulation are correlated with the initial value, these are usually not used to estimate expectations of the target distribution: Eˆf (X ) 1 N B NX n=b+1 f (X n) The initial steps X 1,, X B are called the burn-in. It is common practice to take B to be approximately 10 20% of N, but it is even better to inspect the sample path of each simulation individually and then choose B large enough to omit the initial transient behavior of the chain. There is also a school of thought which holds that a burn-in period is unnecessary if the initial value can be selected from a region of the state space where the target density is relatively large. Jay Taylor (ASU) APM 541 Fall 2013 52 / 53

MCMC Effective Sample Size Because the N consecutive values generated by a MCMC simulation are correlated, these usually provide less accurate estimates of E[f (X )] than would N independent samples. One measure of the accuracy of this estimate is the effective sample size (ESS): ESS = N 1 + 2 P k=1 ρ k ρ k is the k-step autocorrelation of `f (X n); n 1. ESS can be estimated using the sample autocorrelation function. Convergence of the chain can also be assessed by visual inspection of the sample path of the chain. Jay Taylor (ASU) APM 541 Fall 2013 53 / 53