APM 541: Stochastic Modelling in Biology Bayesian Inference. Jay Taylor Fall Jay Taylor (ASU) APM 541 Fall / 53

Size: px
Start display at page:

Download "APM 541: Stochastic Modelling in Biology Bayesian Inference. Jay Taylor Fall Jay Taylor (ASU) APM 541 Fall / 53"

Transcription

1 APM 541: Stochastic Modelling in Biology Bayesian Inference Jay Taylor Fall 2013 Jay Taylor (ASU) APM 541 Fall / 53

2 Outline Outline 1 Introduction 2 Conjugate Distributions 3 Noninformative priors 4 MCMC Jay Taylor (ASU) APM 541 Fall / 53

3 Introduction Bayesian Statistics: Overview Bayesian statistics is a collection of methods that rely on two basic principles: 1 Probability is interpreted as a measure of uncertainty, whatever the source. Thus, in a Bayesian analysis, it is standard practice to assign probability distributions not only to unseen data, but also to parameters, models, and hypotheses. 2 Bayes formula is used to update the probability distributions representing uncertainty as new data are acquired. Recall that if A and B are two events with positive probabilities P(A) > 0 and P(B) > 0, then Bayes formula asserts that P(B A) = P(B) P(A B) P(A). Jay Taylor (ASU) APM 541 Fall / 53

4 Introduction Suppose that our goal is to use some newly acquired data to estimate the value of an unknown parameter Θ. A Bayesian treatment of this problem would consist of the following steps: 1 We first need to formulate a statistical model which specifies the conditional distribution of the data on each possible set of parameter values. This is specified by the likelihood function, p(d Θ = θ). 2 We then choose a prior distribution, p(θ), for the unknown parameter which quantifies how strongly we believe that Θ = θ before we examine the new data. 3 We then examine the new data D. 4 In light of this new data, we use Bayes formula to revise our beliefs concerning the value of the parameter Θ: p(θ D) = p(θ) p(d θ) p(d). The quantity p(θ D) is called the posterior distribution of the parameters θ given the data. p(d) is sometimes called the marginal likelihood or the prior predictive probability. Jay Taylor (ASU) APM 541 Fall / 53

5 Introduction Example: Sex ratios (at birth) vary widely in the animal kingdom and some species of invertebrates have extremely skewed sex ratios, with predominantly female broods. Suppose that we are studying a newly described species that we suspect of having a female-biased sex ratio. For the sake of the example, we will make the following simplifying assumptions: The probability of being born female does not vary within or between broods. In particular, there is no environmental or heritable variation in sex ratio. The sexes of the different members of a brood are determined independently of one another. We will let Θ denote the unknown probability that an individual is born female. Conditional on Θ = θ, the sex ratio at birth will be θ/1 θ. To estimate the birth sex ratio we observe b broods and determine that the i th brood contains f i females and m i males. Let n be the total number of offspring in all b broods and let f and m be the total numbers of females and males among them. Jay Taylor (ASU) APM 541 Fall / 53

6 Introduction We need to specify both the likelihood function and the prior distribution. Under our assumptions about sex determination, the likelihood function depends only on the sex ratio and the total numbers of males and females in the broods. Conditional on Θ = θ, the total number of female offspring among the n offspring is binomially distributed with parameters n and θ:! P(f Θ = θ) = n θ f (1 θ) n f. f We need to select a prior distribution on Θ and this should reflect what we have previously learned about the birth sex ratio in this species. This could be determined by prior observations of this species or by information about the sex ratio in closely-related species. Also, since the value of Θ is a probability, our prior distribution should be supported on the interval [0, 1]. Here I will let the prior distribution be the Beta distribution with parameters 2 and 1: p(θ) = 2θ. This places greater weight on more female-skewed sex ratios (θ > 1/2) without ruling out the possibility of an even sex ratio. Jay Taylor (ASU) APM 541 Fall / 53

7 Introduction We are now ready to use Bayes formula to calculate the posterior distribution of Θ conditional on having observed f female offspring out of a total of n offspring: p(θ = θ f, m) = 2θ `n θf (1 θ) m f P(f, m) = 1 C θf +1 (1 θ) m, where C is a normalizing constant that does not depend on x. This we recognize as another Beta distribution, now with parameters f + 2 and m + 1, i.e., p(θ = θ f, m) = 1 β(f + 2, m + 1) θf +1 (1 θ) m. Remark: Notice that we did not have to evaluate the marginal likelihood P(f, m) in this example because it was absorbed into the normalizing constant. Jay Taylor (ASU) APM 541 Fall / 53

8 Introduction It is often good practice to investigate the sensitivity of the posterior distribution to different choices of the prior distribution. For example, if we knew nothing about the birth sex ratios in the taxon containing our new species, then we might take the prior distribution to be uniform on [0, 1]. In this case, the posterior distribution of Θ would be proportional to the likelihood, giving p(θ = θ f, m) = `n θf (1 θ) m f P(f, m) = θ f (1 θ) m β(f + 1, m + 1), i.e., the posterior distribution is the beta distribution with parameters f + 1 and m + 1. Alternatively, if we have grounds to believe that the sex ratio is even, then we might take the prior distribution to be the Beta distribution with parameters 5 and 5, in which case the posterior distribution is still a Beta distribution, but with parameters f + 5 and m + 5: p(θ = θ f, m) = θ4 (1 θ) 4 β(5, 5) `n θf (1 θ) m f P(f, m) = θ f +4 (1 θ) m+4 β(f + 5, m + 5). Jay Taylor (ASU) APM 541 Fall / 53

9 Introduction The figures show the posterior distributions corresponding to each of these three priors for two different data sets: either f = 7, m = 3 (left plot) or f = 70, m = 30 (right plot) Posterior Distribution: f =7, m=3 Beta(2,1) Beta(1,1) Beta(5,5) 9 8 Posterior Distribution: f =70, m=30 Beta(2,1) Beta(1,1) Beta(5,5) 3 7 posterior density posterior density Jay Taylor (ASU) APM 541 Fall / 53

10 Introduction Remarks: The posterior distribution depends on both the prior distribution and the data, but usually to different degrees. If there is little data or if the data is not very informative about the unknown parameters, then the posterior distribution will be dominated by the prior distribution. Even if the data is both abundant and very informative, this will be reflected by the posterior distribution only if the prior distribution is sufficiently diffuse. Cromwell s rule: The prior distribution should only assign a probability of 0 to events that are logically false. The information content of the data can be evaluated by determining how different the posterior distribution is from the prior distribution. This is often done using the Kullback-Leibler divergence. Jay Taylor (ASU) APM 541 Fall / 53

11 Introduction Summaries of the Posterior Distribution While the posterior distribution comprehensively represents the state of our knowledge concerning Θ once we have acquired the data, it is sometimes convenient to have simple numerical summaries of this distribution. Point estimators of Θ include the mean and the median of the posterior distribution. The maximum a posteriori (MAP) estimate of Θ is the value θ MAP that maximizes the posterior distribution. This is comparable to the maximum likelihood estimate, but is less robust to approximation error. The spread of the posterior distribution is often summarized by the standard deviation or by the α/2 and 1 α/2 quantiles (e.g., with α = 0.05). The posterior distribution can also be summarized by specifying the 100(1 α)% highest probability density (HPD) region. Jay Taylor (ASU) APM 541 Fall / 53

12 Introduction The table lists various summary statistics for the posterior distributions calculated in the sex ratio example. data prior mean median sd 2.5% 97.5% 7, 3 β(2, 1) β(1, 1) β(5, 5) , 30 β(2, 1) β(1, 1) β(5, 5) Interpretation: Whereas the small data set does not allow us to reject the hypothesis that the sex ratio is 1 : 1, all three analyses of the large data set suggest that the probability of being born female is greater than Jay Taylor (ASU) APM 541 Fall / 53

13 Introduction Acquiring New Data One of the strengths of Bayesian statistics is that it can easily handle sequentially acquired data. Example: Suppose that in our first experiment we collected five broods totaling 7 females and 3 males. As we can see on the previous slide, all three posterior distributions are fairly broad and only weakly suggest that the sex ratio is female-biased. Suppose that we then conduct a second experiment, independent of the first, during which we observe 15 females and 5 males. We can again use Bayes formula to account for this new data, but now we use the former posterior distribution as our new prior distribution: p(θ = θ f = 15, m = 5) = θ8 (1 θ) 3 β(9, 4) `20 θ15 (1 θ) 5 15 P(f = 15, m = 5) = θ23 (1 θ) 8 β(24, 9). If we were to then conduct a third experiment, we would use the posterior distribution obtained from the second as the new prior and so forth. Jay Taylor (ASU) APM 541 Fall / 53

14 Introduction Criticisms A number of objections have been raised against Bayesian methodology. Some critics object to the use of probabilities to quantify uncertainty in belief or plausibility. In particular, in the frequentist interpretation of probability, probabilities are only defined when they can be calculated as limiting frequencies in a sequence of independent trials. This stance largely rules out assigning probabilities to unknown parameters or models. Another criticism is that Bayesian methods are not objective because inference depends on a prior distribution, which can be chosen subjectively. A Bayesian rejoinder would be that the likelihood function, which is also used in frequentist analyses, is no less subjective than the prior. A third criticism, which is less relevant nowadays, is that Bayesian methods are computationally demanding. In the past this was often an insurmountable obstacle to conducting a Bayesian analysis. Jay Taylor (ASU) APM 541 Fall / 53

15 Conjugate Distributions Conjugate Prior Distributions In the example we saw that both the prior distribution and the posterior distribution were beta distributions that differed only in the values of their parameters. Although the prior and posterior distribution will only belong to the same functional family under special circumstances, this situation is important enough in practice to have been studied in depth. Definition Suppose that F is a set of likelihood functions and that P is a set of prior distributions for θ. Then P is said to be conjugate for F if whenever we choose a likelihood function p(x θ) F and a prior distribution p(θ) P, then the posterior distribution also belongs to P. p(θ x) = p(θ) p(x θ) p(x) Jay Taylor (ASU) APM 541 Fall / 53

16 Conjugate Distributions Since we can always render a prior distribution conjugate for any set of likelihood functions simply by taking P sufficiently large, the previous definition is almost too general to be useful. However, there is an important special case when conjugacy becomes non-trivial. Definition A parametric family of distributions F on a set E R n is said to be an exponential family if the density of every member of the family has the following form! mx p(x θ) = h(x) exp η i (θ)t i (x) ψ(θ), x E where θ = (θ 1,, θ k ) Ω R k is the vector-valued parameter; i=1 ψ : Ω R is the log-partition function; h : E [0, ]; η = (η 1,, η m) : Ω R m is the vector-valued natural parameter; T = (t 1,, t m) : E R m is a vector of the same dimension as η. T (x) is said to be a sufficient statistic for the distribution. Jay Taylor (ASU) APM 541 Fall / 53

17 Conjugate Distributions Many familiar parametric distributions belong to exponential families. Example: The exponential distributions are an exponential family since we can write the density as: where E = Ω = [0, ); θ = λ; h(x) = 1 and ψ(λ) = ln(λ); η 1(λ) = λ and t 1(x) = x. p(x λ) = λe λx = e λ x+ln(λ) Jay Taylor (ASU) APM 541 Fall / 53

18 Conjugate Distributions Example: The normal distributions are an exponential family since we can write the density as: In this case, E = R; p(x µ, σ) = 1 2πσ 2 e (x µ)2 /2σ 2 µ = exp σ x 1 2 2σ x 2 µ2 2 2σ 1 «2 2 ln(2πσ2 ). Ω = R (0, ) with θ 1 = µ and θ 2 = σ; h(x) = 1; ψ(µ, σ) = µ 2 /2σ 2 + ln(2πµ 2 )/2; η 1(µ, σ) = µ/σ 2 and η 2(µ, σ) = 1/(2σ 2 ); t 1(x) = x and t 2(x) = x 2. Jay Taylor (ASU) APM 541 Fall / 53

19 Conjugate Distributions Example: For each fixed n 1, the binomial distributions are an exponential family: where p n(x p) = E = {0, 1,, n} R; Ω = [0, 1] with θ = p; h(x) = `n x and ψ(p) = n ln(1 p); η 1(p) = ln(p/(1 p)) and t 1(x) = x.! ««n p exp ln x + n ln(1 p). x 1 p Remark: The binomial distributions do not form an exponential family if n and p are both allowed to vary. In effect, this is because there is no way to factor `n x into a product of a function of n and a function of x. Jay Taylor (ASU) APM 541 Fall / 53

20 Conjugate Distributions Every Exponential Family has a Parametric Conjugate Prior Distribution Our interest in exponential families is motivated by the following observation. Suppose that the likelihood function belongs to an exponential family F with densities p(x θ) = h(x) exp! mx η i (θ)t i (x) ψ(θ) and let P be the set of prior distributions on θ with densities of the form i=1 p(θ; a 1,, a m, d) 1 C exp m X i=1 a i η i (θ) dψ(θ)!, where a 1,, a n and d are constants and C = C(a 1,, a m, d) is a normalizing constant. Notice that P is itself a parametric family with an (m + 1)-dimensional parameter. Jay Taylor (ASU) APM 541 Fall / 53

21 Conjugate Distributions Then the posterior distribution is p(θ x) = 1 m! C h(x) exp X 1 η i (θ)(a i + t i (x)) (d + 1)ψ(θ) p(x) i=1 = 1 m! C exp X η i (θ)(a i + t i (x)) (d + 1)ψ(θ) i=1 = p(θ; a 1 + t 1(x),, a m + t m(x), d + 1). This shows that the posterior distribution is also a member of P and therefore that P is conjugate for F. In this case, the posterior distribution can be calculated simply by updating the parameters of the prior distribution: a i a i + t i (x), i = 1, m, d d + 1. Because this updating usually requires much less computation than direct evaluation of the marginal likelihood of the data, p(x), conjugate priors are often used in problems where computational efficiency is at a premium. Jay Taylor (ASU) APM 541 Fall / 53

22 Conjugate Distributions Example: A conjugate prior for the exponential distribution, p(x λ) = λe λx, is p(λ; a, d) = 1 C e aλ d ln(λ) = ad+1 Γ(d + 1) λd e aλ, which we recognize as the Gamma distribution with scale parameter a and shape parameter d + 1. If this is used as the prior distribution, then the posterior distribution given data x is: p(λ x) = (a + x)d+2 Γ(d + 2) λd+1 e (a+x)λ, which is a Gamma distribution with scale parameter a + x and shape parameter d + 2. Jay Taylor (ASU) APM 541 Fall / 53

23 Conjugate Distributions Example: A conjugate prior for the binomial distribution, p n(x p) = `n x px (1 p) n x, with fixed n is p n(p; a, d) = 1 ««p C (1 p)nd exp (a 1) ln + dn ln(1 p) 1 p = 1 C pa 1 (1 p) nd a+1, where C = C(a, d, n) is a normalizing constant. If we take d = (a + b 2)/n, then this is just a Beta distribution with parameters a and b and then the posterior distribution is p n(p x) = 1 β(a + x, b + n x) pa+x 1 (1 p) b+n x 1, which is a Beta distribution with parameters a + x and b + n x. Jay Taylor (ASU) APM 541 Fall / 53

24 Conjugate Distributions Example: A conjugate prior for the normal distribution p(x µ, σ 2 ) = 1 2πσ 2 e (x µ)2 /2σ 2, is p(µ, σ 2 ) = 1 C exp m(2a + 3) µ σ + b 1 2 2σ j b a = Γ(a) (σ2 ) a 1 exp bσ 2 «ff µ 2 (2a + 3) 2 j 1 2πσ 2 exp 2σ 1 ««2 2 ln(2πσ2 ) «ff. (µ m)2 2σ 2 This is a special case of the so-called normal-inverse-gamma distribution, denoted N Γ 1, with parameters m, 1, a and b. If this is used the prior distribution, then the posterior distribution given a single observation x also has the normal-inverse-gamma distribution, but now with parameters (m + x)/2, 2, a + 1/2, and b + (x m) 2 /4. Jay Taylor (ASU) APM 541 Fall / 53

25 Conjugate Distributions Although the normal-inverse-gamma distribution appears to be complicated, it has a fairly straightforward interpretation. Suppose that the random vector (µ, σ 2 ) is sampled in the following way: 1 σ 2 Gamma(a, b) µ σ 2 N (m, σ 2 /λ). Then (µ, σ 2 ) N Γ 1 (m, λ, a, b). In other words, we first sample the precision τ 2 = 1/σ 2 from a Gamma distribution with parameters a and b. Then, having sampled σ 2, we next sample µ from a normal distribution with mean m and variance σ 2 /λ. Jay Taylor (ASU) APM 541 Fall / 53

26 Noninformative priors Noninformative Priors When there is little prior information concerning the value of an unknown parameter, some practitioners recommend using a so-called noninformative or reference prior distribution. Ostensibly, these distributions are chosen so that the posterior distribution only depends on the data, although whether this is ever really the case is debated. Several methods for selecting uninformative priors have been proposed: Principle of indifference; Maximum entropy priors; Jeffreys priors; Reference priors. Jay Taylor (ASU) APM 541 Fall / 53

27 Noninformative priors Principle of Indifference Suppose that we wish to assign a prior distribution p to a finite set E = {e 1,, e n} and that we have no reason to believe that any one of the outcomes is more likely than any of the others. In this case, the principle of indifference states that the uniform distribution on E should be chosen as the prior distribution, i.e., p(e i ) = 1, i = 1,, n. n Example: If we are asked to predict the outcome of a coin toss and we know nothing about the coin, then according to the principle of indifference our prior distribution should assign probability 1/2 to heads and probability 1/2 to tails. Jay Taylor (ASU) APM 541 Fall / 53

28 Noninformative priors Maximum Entropy Priors There are several different concepts of entropy, but each of these is meant to quantify the amount of uncertainty in a distribution. Definition Suppose that P is a probability distribution on a set E. 1 If E = {e 1, e 2, } is discrete and P has probability mass function p(x), then the Shannon entropy of P is the quantity H(P) X i p(e i ) log b (p(e i )), where the base b is usually taken to be 2, e or 10. Notice that changing b simply changes the units in which the entropy is expressed. 2 If E R n is an open set and P is a continuous distribution on E with density p(x), then the differential entropy of P is the quantity Z H(P) p(x) log b (p(x))dx. E Jay Taylor (ASU) APM 541 Fall / 53

29 Noninformative priors One weakness of the previous definition (particularly in the case of the differential entropy) is that the entropies defined there depend on the parametrization of E, i.e., if ψ : E E is a one-to-one and onto function mapping E to E and P is the probability distribution induced on E by this transformation, then in general H(P) H(P ). This is problematic because in some sense P and P are the same distribution. All that we have done is to relabel the points of E using the transformation E. Accordingly, if the entropy is meant to be a measure of the uncertainty associated with a distribution, then P and P should have the same entropy. Jay Taylor (ASU) APM 541 Fall / 53

30 Noninformative priors In light of this issue, it is often useful to associate a reference measure m with E which is sometimes called the invariant measure of E. For example, if E possesses a natural set of symmetries (e.g., translations of R or rotations of S 1 ), then m might be chosen to be a measure that is invariant under these symmetry transformations. In this case, we have the following more general definition of entropy due to Jaynes. Definition Suppose that E is equipped with an invariant measure m and that P is a probability distribution on E that has a density f with respect to m, i.e., for every subset A E Z P(A) = f (x)m(dx). Then the entropy of P (with respect to m) is the quantity Z H c(p) log (f (x)) P(dx). E A It can be shown that H c(p) is not changed by a reparameterization of the sample space. Jay Taylor (ASU) APM 541 Fall / 53

31 Noninformative priors The Maximum Entropy Principle Problem: Suppose that the set E is equipped with an invariant measure m and that our goal is to choose a prior distribution on E that is consistent with some body of prior information. Let P be the collection of probability distributions on E which have densities with respect to m and which are consistent with this information. How do we choose a prior distribution from this set? Solution: According to the maximum entropy principle, we should select the distribution from P that has maximum entropy with respect to m, i.e., choose P maxent such that H c(p maxent) H c(p) P P. Rationale: We should choose a prior distribution that is consistent with our prior knowledge without introducing any hidden assumptions. According to the maximum entropy principle, the right way to do this is to choose the distribution which maximizes uncertainty subject to the prior constraints. This is a version of Occam s razor. Jay Taylor (ASU) APM 541 Fall / 53

32 Noninformative priors Maximum Entropy Distributions: Discrete Case Suppose that E = {e 1, e 2, } is a discrete set and let m be a measure on E. We will assume that our prior knowledge is such that P is the set of all probability distributions on E such that X f k (e)p(e) = f k e E where for each k = 1,, r, f k : E R is a real-valued function and f k is a constant. In this case, the maximum entropy distribution will have the form p(e) = 1 Z m(e) exp ( ) rx λ k f k (e) where λ 1,, λ k are Lagrange multipliers that must be chosen so that the r constraints are satisfied and Z is a normalizing constant. It can be shown that this distribution is a member of an exponential family and also a Gibbs measure on E. k=1 Jay Taylor (ASU) APM 541 Fall / 53

33 Noninformative priors Example: Suppose that: E = {e 1,, e n} is finite; the invariant measure is given by a function m : E [0, ) with positive total mass; we have no prior knowledge and thus P is simply the collection of all probability distributions on E. Because there are no constraints (besides normalization), there are no Lagrange multipliers to be solved for in this case and so the maximum entropy distribution is simply!. nx p(e i ) = m(e i ) m(e i ) In particular, if m is constant, then the maximum entropy distribution is just the uniform distribution on E, in which case the maximum entropy principle and the principle of indifference lead to same choice of prior distribution. i=1 Jay Taylor (ASU) APM 541 Fall / 53

34 Noninformative priors Maximum Entropy Distributions: Continuous Case Suppose that E = (a, b) R is an interval and let m be a measure on E. We will assume that P is the set of probability distributions on E which have densities with respect to m and which satisfy the constraints Z f k (x)p(x)m(dx) = f k where for each k = 1,, r, f k : E R is a real-valued function and f k is a constant. In this case, the maximum entropy distribution will have a density (with respect to m) of the form p(x) = 1 Z exp ( ) rx λ k f k (x) where the constants λ 1,, λ k must be chosen so that the r constraints are satisfied and Z is a normalizing constant. As in the discrete case, this distribution is both a member of an exponential family and a Gibbs measure on E. k=1 Jay Taylor (ASU) APM 541 Fall / 53

35 Noninformative priors Examples: If E = [a, b] is compact, m(dx) = dx, and there are no prior constraints, then the maximum entropy distribution on E is the uniform distribution. If E = [0, ), m(dx) = dx, and P is the collection of continuous distributions on E with mean µ > 0, then the maximum entropy distribution in this set is the exponential distribution with rate parameter µ 1. If E = R, m(dx) = dx, and P is the collection of continuous distributions on E with mean µ and variance σ 2 <, then the maximum entropy distribution in this set is the normal distribution with this mean and variance. If E = R, m(dx) = dx, and P is the collection of continuous distributions on E with mean µ, there is no maximum entropy distribution. Indeed, since the entropy of the normal distribution with mean µ and variance σ 2 is ln(2πeσ 2 )/2, we can construct a sequence of distributions in P with entropy tending to. Jay Taylor (ASU) APM 541 Fall / 53

36 Noninformative priors Jeffreys Priors As the last example illustrates, not every class of probability distributions contains a maximum entropy distribution. This is especially likely if there are no prior constraints on that class. In lieu of the maximum entropy principle, we desire a rule that will choose an uninformative prior in a manner that is invariant under reparameterization. That is, if we have a procedure Π that chooses a distribution Π E on each space E, then for any bijection ψ : E E we want the following identity to hold: Π ψ(e) = Π E ψ 1, i.e., we should arrive at the same distribution no matter whether we first choose the prior distribution and then reparameterize or we first reparameterize and then select the prior distribution. It turns out that there is essentially a unique solution to this problem, which was described by Harold Jeffreys, and leads to a distribution called Jeffreys prior. Jay Taylor (ASU) APM 541 Fall / 53

37 Noninformative priors Before we can describe the Jeffreys prior, we need to introduce the Fisher information of a parametric family of distributions. Definition Suppose that p(x; θ) is a probability distribution on a set E that depends smoothly on a parameter θ Ω R d. Then the Fisher information of p is the d d matrix I(θ) with entries» ««θ (I(θ)) i,j = E log p(x ; θ) log p(x ; θ), θ i θ j where X is a random variable with the distribution p(x; θ). Interpretation: The Fisher information measures the expected curvature of the log-likelihood of the data with respect to the parameter θ. Its components measure how much information the data contains, on average, about the values of the components of the parameter. Jay Taylor (ASU) APM 541 Fall / 53

38 Noninformative priors Now suppose that F is the collection of likelihood functions of the form p(x; θ), where θ Ω R d and, as on the previous slide, we assume that p is a smooth function of the parameter. Then the Jeffreys prior distribution on θ is the distribution on Ω that is proportional to the square root of the determinant of the Fisher information: p(θ) p det I(θ) It can be shown that this procedure is invariant under reparameterizations: if we transform θ to φ = φ(θ), then the Jeffreys prior for φ will be the same distribution as the transformation of the Jeffreys prior for θ. Jay Taylor (ASU) APM 541 Fall / 53

39 Noninformative priors Example: Suppose that the likelihood function is given by the binomial distribution with fixed n and variable θ:! p(x; θ) = n θ x (1 θ) n x. x In this case, the log-likelihood function is log(p(x; θ)) = C + x log(θ) + (n x) log(1 θ), where C is a constant that does not depend on θ. Its derivative with respect to θ is d dθ log(p(x; θ)) = x θ n x 1 θ and consequently the Fisher information is I(θ) = E " X θ n X «# 2 1 θ = n θ(1 θ). Jay Taylor (ASU) APM 541 Fall / 53

40 Noninformative priors It follows that for a binomial likelihood function, with fixed sample size n and unknown success probability θ, the Jeffreys prior on θ is p(θ) = 1 C p I(θ) = 1 π p θ(1 θ), which is just the Beta distribution with parameters 1/2 and 1/2. Notably, this is not the maximum entropy distribution on [0, 1] (which is just the standard uniform distribution), which demonstrates that these two principles lead to different choices of uninformative prior distributions even in relatively simple settings. Jay Taylor (ASU) APM 541 Fall / 53

41 Noninformative priors Examples: If the likelihood is normal with unknown mean µ and known variance σ 2, then the Fisher information is I(µ) = 1 σ 2, while the Jeffreys prior for µ is p(µ) 1 σ 1 which is just Lebesgue measure on R. This is not normalizable and thus is said to be an improper prior. In this case, the use of this improper prior still leads to a proper posterior distribution, but one must be very cautious about the use of such prior distributions. If the likelihood is normal with known mean µ and unknown standard deviation σ, then the Fisher information is I(σ) = 2 σ 2, while the Jeffreys prior for σ is p(σ) 1 σ for σ > 0. Again, this is an improper prior distribution on [0, ). Jay Taylor (ASU) APM 541 Fall / 53

42 MCMC Markov Chain Monte Carlo Methods Until recently, Bayesian methods were regarded as impractical for many problems because of the computational difficulties of evaluating the posterior distribution. In particular, to use Bayes formula, p(θ D) = p(θ) p(d θ) p(d), we need to evaluate the marginal probability of the data Z p(d) = p(d θ)p(θ)dθ, which requires integration of the likelihood function over the parameter space. Except in special cases, this integration must be performed numerically and often this is over a space with dozens or hundreds of dimensions. Jay Taylor (ASU) APM 541 Fall / 53

43 MCMC An alternative to direct evaluation of Bayes formula is to use Monte Carlo methods to sample from the posterior distribution. Ideally, we would like to generate a collection of independent samples, Θ 1,, Θ N with Θ i p(θ D), which can then be used to estimate expectations and probabilities under p(θ D): Z f (θ)p(θ D)dθ 1 N Z A p(θ D)dθ 1 N NX f (Θ i ) i=1 NX 1 A (Θ i ). i=1 In effect, we are using the empirical distribution of the sample to estimate p(θ D): p(θ D) 1 N NX δ Θi. i=1 Jay Taylor (ASU) APM 541 Fall / 53

44 MCMC One way to sample from a given distribution is to construct a Markov chain that has this distribution as its stationary distribution. This idea leads to a set of sampling techniques which are collectively known as Markov chain Monte Carlo (MCMC) methods. Suppose that we can formulate a discrete-time Markov chain (Θ t : t 0) which has p(θ D) as a stationary distribution. If Θ is ergodic, then whatever the initial value of Θ 0, the distribution of Θ t will tend to p(θ D). Furthermore, by sampling the chain at sufficiently widely spaced intervals, say Θ T, Θ 2T, Θ 3T,, we will obtain a collection of samples that are approximately independent. Formulations include the Metropolis-Hastings algorithm and the Gibbs Sampler. Jay Taylor (ASU) APM 541 Fall / 53

45 MCMC The Metropolis-Hastings Algorithm The Metropolis-Hastings algorithm provides a recipe for constructing a Markov chain with a specified stationary distribution on a state space E. It requires the following ingredients: a target distribution with density π(x) on E; a stochastic kernel Q(x; y) on E E: in other words, for each x E, Q(x; y) is the density of a probability distribution on E called the proposal density; a sequence of i.i.d. standard uniform random variables, U 1, U 2,. The aim is to construct a Markov chain X = (X t : t 0) with stationary distribution π. Remark: If we need to sample discrete random variables, then we can interpret each of the above probability densities as a probability mass function. (More broadly, we can assume that all of the densities are defined with respect to a common reference measure on E.) Jay Taylor (ASU) APM 541 Fall / 53

46 MCMC The Metropolis-Hastings algorithm consists of repeated application of the following three steps. Suppose that X n = x is the current state of the Markov chain. Then the next state is chosen as follows: Step 1: We first propose a new value for the chain by sampling Y n+1 = y from the density Q(x; y). Step 2: We then calculate the acceptance probability of the new state: j ff π(y)q(y; x) α(x; y) = min π(x)q(x; y), 1 Step 3: If U n+1 α(x, y), then we set X n+1 = Y n+1. Otherwise, we set X n+1 = X n. Jay Taylor (ASU) APM 541 Fall / 53

47 MCMC Interpretation: Notice that the acceptance probability can be written as α(x; y) = min{r MH (x; y), 1}, where r MH (x; y) is the Metropolis-Hastings ratio r MH (x; y) = π(y)q(y; x) π(x)q(x; y) This shows that the algorithm will tend to accept moves from x to y either when: the target density is larger at y than at x, i.e., π(y) > π(x), or the proposal density favors transitions from x to y over transitions from y to x. Moves of the first kind occur so that the Markov chain will spend a disproportionate amount of time in those parts of the state space where the target density is large. In contrast, moves of the second kind occur because the algorithm must compensate for the bias imposed by the proposal distribution. Jay Taylor (ASU) APM 541 Fall / 53

48 MCMC Theorem If X is the Markov chain described on the previous slide, then π is a stationary distribution of the chain X. Proof: We need to show that if X n has density π(x), then so does X n+1. We first observe that the transition function of this Markov chain is 8 >< Q(x; y)α(x; y)dy if y x P(x; dy) = R >: Q(x; z)`1 α(x; z) dz δ E x(dy) if y = x, where δ x(dy) is the degenerate probability distribution that assigns probability 1 to the point x. The transition function has two parts because the Markov chain can either jump to the new proposed state (top line) or it can reject the proposed state and remain in its current location (second line). Jay Taylor (ASU) APM 541 Fall / 53

49 MCMC Next, for each y E, let E y = {x E : r MH (y; x) < 1} E y = {x E : r MH (x; y) < 1}, and notice that since r MH (y; x) = 1/r MH (x; y), we have E y = (E y ) c. Then, Z x E π(x)p(x; dy)dx = Z π(y)q(y; x) π(x)q(x; y) E y π(x)q(x; y) dx Z! + π(x)q(x; y)dx dy (E y ) c Z + π(y)dy Q(y; z) 1 E y «dy «π(z)q(z; y) dz π(y)q(y; z) Jay Taylor (ASU) APM 541 Fall / 53

50 MCMC = = = Z «π(y)q(y; x)dx dy E y Z! + π(x)q(x; y)dx dy (E y ) c Z! + `π(y)q(y; z) π(z)q(z; y) dz dy (E y ) c Z! π(y)q(y; z)dz dy Z E y E c y E = π(y)dy. Q(y; z)dz «π(y)dy Jay Taylor (ASU) APM 541 Fall / 53

51 MCMC The Proposal Distribution Although there is no universally-valid rule to tell us how to choose the proposal distribution, the following considerations are important: Given any two sets A, B E with π(a) > 0, π(b) > 0, it should be possible to go from A to B in finitely many moves. If this condition is violated, then the chain will not converge to the stationary distribution. It should be easy to sample from Q(x; y). Q should be chosen so that the chain has good mixing properties, i.e., the chain should rapidly approach equilibrium when started from any x E: P{X n dy X 0 = x} π(dy). One practical guideline is that the acceptance probability α should be on the order of It is often a good idea to try out different proposal distributions to identify one with good properties. Jay Taylor (ASU) APM 541 Fall / 53

52 MCMC Initialization and Burn-In Because the early values of a MCMC simulation are correlated with the initial value, these are usually not used to estimate expectations of the target distribution: Eˆf (X ) 1 N B NX n=b+1 f (X n) The initial steps X 1,, X B are called the burn-in. It is common practice to take B to be approximately 10 20% of N, but it is even better to inspect the sample path of each simulation individually and then choose B large enough to omit the initial transient behavior of the chain. There is also a school of thought which holds that a burn-in period is unnecessary if the initial value can be selected from a region of the state space where the target density is relatively large. Jay Taylor (ASU) APM 541 Fall / 53

53 MCMC Effective Sample Size Because the N consecutive values generated by a MCMC simulation are correlated, these usually provide less accurate estimates of E[f (X )] than would N independent samples. One measure of the accuracy of this estimate is the effective sample size (ESS): ESS = N P k=1 ρ k ρ k is the k-step autocorrelation of `f (X n); n 1. ESS can be estimated using the sample autocorrelation function. Convergence of the chain can also be assessed by visual inspection of the sample path of the chain. Jay Taylor (ASU) APM 541 Fall / 53

Molecular Epidemiology Workshop: Bayesian Data Analysis

Molecular Epidemiology Workshop: Bayesian Data Analysis Molecular Epidemiology Workshop: Bayesian Data Analysis Jay Taylor and Ananias Escalante School of Mathematical and Statistical Sciences Center for Evolutionary Medicine and Informatics Arizona State University

More information

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) Markov Chain Monte Carlo (MCMC Dependent Sampling Suppose we wish to sample from a density π, and we can evaluate π as a function but have no means to directly generate a sample. Rejection sampling can

More information

Bayesian Inference and MCMC

Bayesian Inference and MCMC Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007 Bayesian inference Fredrik Ronquist and Peter Beerli October 3, 2007 1 Introduction The last few decades has seen a growing interest in Bayesian inference, an alternative approach to statistical inference.

More information

CS281A/Stat241A Lecture 22

CS281A/Stat241A Lecture 22 CS281A/Stat241A Lecture 22 p. 1/4 CS281A/Stat241A Lecture 22 Monte Carlo Methods Peter Bartlett CS281A/Stat241A Lecture 22 p. 2/4 Key ideas of this lecture Sampling in Bayesian methods: Predictive distribution

More information

Bayesian Inference: Concept and Practice

Bayesian Inference: Concept and Practice Inference: Concept and Practice fundamentals Johan A. Elkink School of Politics & International Relations University College Dublin 5 June 2017 1 2 3 Bayes theorem In order to estimate the parameters of

More information

Bayesian Phylogenetics:

Bayesian Phylogenetics: Bayesian Phylogenetics: an introduction Marc A. Suchard msuchard@ucla.edu UCLA Who is this man? How sure are you? The one true tree? Methods we ve learned so far try to find a single tree that best describes

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017 Bayesian inference Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark April 10, 2017 1 / 22 Outline for today A genetic example Bayes theorem Examples Priors Posterior summaries

More information

Physics 403. Segev BenZvi. Choosing Priors and the Principle of Maximum Entropy. Department of Physics and Astronomy University of Rochester

Physics 403. Segev BenZvi. Choosing Priors and the Principle of Maximum Entropy. Department of Physics and Astronomy University of Rochester Physics 403 Choosing Priors and the Principle of Maximum Entropy Segev BenZvi Department of Physics and Astronomy University of Rochester Table of Contents 1 Review of Last Class Odds Ratio Occam Factors

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

an introduction to bayesian inference

an introduction to bayesian inference with an application to network analysis http://jakehofman.com january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena

More information

The binomial model. Assume a uniform prior distribution on p(θ). Write the pdf for this distribution.

The binomial model. Assume a uniform prior distribution on p(θ). Write the pdf for this distribution. The binomial model Example. After suspicious performance in the weekly soccer match, 37 mathematical sciences students, staff, and faculty were tested for the use of performance enhancing analytics. Let

More information

Default Priors and Effcient Posterior Computation in Bayesian

Default Priors and Effcient Posterior Computation in Bayesian Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature

More information

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2 Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, 2010 Jeffreys priors Lecturer: Michael I. Jordan Scribe: Timothy Hunter 1 Priors for the multivariate Gaussian Consider a multivariate

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

The Jeffreys Prior. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) The Jeffreys Prior MATH / 13

The Jeffreys Prior. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) The Jeffreys Prior MATH / 13 The Jeffreys Prior Yingbo Li Clemson University MATH 9810 Yingbo Li (Clemson) The Jeffreys Prior MATH 9810 1 / 13 Sir Harold Jeffreys English mathematician, statistician, geophysicist, and astronomer His

More information

Probability and Estimation. Alan Moses

Probability and Estimation. Alan Moses Probability and Estimation Alan Moses Random variables and probability A random variable is like a variable in algebra (e.g., y=e x ), but where at least part of the variability is taken to be stochastic.

More information

Machine Learning using Bayesian Approaches

Machine Learning using Bayesian Approaches Machine Learning using Bayesian Approaches Sargur N. Srihari University at Buffalo, State University of New York 1 Outline 1. Progress in ML and PR 2. Fully Bayesian Approach 1. Probability theory Bayes

More information

A Very Brief Summary of Bayesian Inference, and Examples

A Very Brief Summary of Bayesian Inference, and Examples A Very Brief Summary of Bayesian Inference, and Examples Trinity Term 009 Prof Gesine Reinert Our starting point are data x = x 1, x,, x n, which we view as realisations of random variables X 1, X,, X

More information

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn Parameter estimation and forecasting Cristiano Porciani AIfA, Uni-Bonn Questions? C. Porciani Estimation & forecasting 2 Temperature fluctuations Variance at multipole l (angle ~180o/l) C. Porciani Estimation

More information

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016 Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016 EPSY 905: Intro to Bayesian and MCMC Today s Class An

More information

Quantitative Biology II Lecture 4: Variational Methods

Quantitative Biology II Lecture 4: Variational Methods 10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate

More information

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58 Overview of Course So far, we have studied The concept of Bayesian network Independence and Separation in Bayesian networks Inference in Bayesian networks The rest of the course: Data analysis using Bayesian

More information

Lecture 7 and 8: Markov Chain Monte Carlo

Lecture 7 and 8: Markov Chain Monte Carlo Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani

More information

Inverse Problems in the Bayesian Framework

Inverse Problems in the Bayesian Framework Inverse Problems in the Bayesian Framework Daniela Calvetti Case Western Reserve University Cleveland, Ohio Raleigh, NC, July 2016 Bayes Formula Stochastic model: Two random variables X R n, B R m, where

More information

Conjugate Priors, Uninformative Priors

Conjugate Priors, Uninformative Priors Conjugate Priors, Uninformative Priors Nasim Zolaktaf UBC Machine Learning Reading Group January 2016 Outline Exponential Families Conjugacy Conjugate priors Mixture of conjugate prior Uninformative priors

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet. Stat 535 C - Statistical Computing & Monte Carlo Methods Arnaud Doucet Email: arnaud@cs.ubc.ca 1 Suggested Projects: www.cs.ubc.ca/~arnaud/projects.html First assignement on the web: capture/recapture.

More information

17 : Markov Chain Monte Carlo

17 : Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo

More information

Bayesian Inference in Astronomy & Astrophysics A Short Course

Bayesian Inference in Astronomy & Astrophysics A Short Course Bayesian Inference in Astronomy & Astrophysics A Short Course Tom Loredo Dept. of Astronomy, Cornell University p.1/37 Five Lectures Overview of Bayesian Inference From Gaussians to Periodograms Learning

More information

Statistical Theory MT 2007 Problems 4: Solution sketches

Statistical Theory MT 2007 Problems 4: Solution sketches Statistical Theory MT 007 Problems 4: Solution sketches 1. Consider a 1-parameter exponential family model with density f(x θ) = f(x)g(θ)exp{cφ(θ)h(x)}, x X. Suppose that the prior distribution has the

More information

Remarks on Improper Ignorance Priors

Remarks on Improper Ignorance Priors As a limit of proper priors Remarks on Improper Ignorance Priors Two caveats relating to computations with improper priors, based on their relationship with finitely-additive, but not countably-additive

More information

MCMC Methods: Gibbs and Metropolis

MCMC Methods: Gibbs and Metropolis MCMC Methods: Gibbs and Metropolis Patrick Breheny February 28 Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 1/30 Introduction As we have seen, the ability to sample from the posterior distribution

More information

Data Analysis and Uncertainty Part 2: Estimation

Data Analysis and Uncertainty Part 2: Estimation Data Analysis and Uncertainty Part 2: Estimation Instructor: Sargur N. University at Buffalo The State University of New York srihari@cedar.buffalo.edu 1 Topics in Estimation 1. Estimation 2. Desirable

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters

More information

Invariant HPD credible sets and MAP estimators

Invariant HPD credible sets and MAP estimators Bayesian Analysis (007), Number 4, pp. 681 69 Invariant HPD credible sets and MAP estimators Pierre Druilhet and Jean-Michel Marin Abstract. MAP estimators and HPD credible sets are often criticized in

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

David Giles Bayesian Econometrics

David Giles Bayesian Econometrics David Giles Bayesian Econometrics 1. General Background 2. Constructing Prior Distributions 3. Properties of Bayes Estimators and Tests 4. Bayesian Analysis of the Multiple Regression Model 5. Bayesian

More information

STA 294: Stochastic Processes & Bayesian Nonparametrics

STA 294: Stochastic Processes & Bayesian Nonparametrics MARKOV CHAINS AND CONVERGENCE CONCEPTS Markov chains are among the simplest stochastic processes, just one step beyond iid sequences of random variables. Traditionally they ve been used in modelling a

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

Bayesian Inference. Chapter 2: Conjugate models

Bayesian Inference. Chapter 2: Conjugate models Bayesian Inference Chapter 2: Conjugate models Conchi Ausín and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Master in Business Administration and Quantitative Methods Master in

More information

STAT 425: Introduction to Bayesian Analysis

STAT 425: Introduction to Bayesian Analysis STAT 425: Introduction to Bayesian Analysis Marina Vannucci Rice University, USA Fall 2017 Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 2017 1 / 10 Lecture 7: Prior Types Subjective

More information

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018 CLASS NOTES Models, Algorithms and Data: Introduction to computing 208 Petros Koumoutsakos, Jens Honore Walther (Last update: June, 208) IMPORTANT DISCLAIMERS. REFERENCES: Much of the material (ideas,

More information

Markov Chain Monte Carlo

Markov Chain Monte Carlo Chapter 5 Markov Chain Monte Carlo MCMC is a kind of improvement of the Monte Carlo method By sampling from a Markov chain whose stationary distribution is the desired sampling distributuion, it is possible

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2008 Prof. Gesine Reinert 1 Data x = x 1, x 2,..., x n, realisations of random variables X 1, X 2,..., X n with distribution (model)

More information

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models Matthew S. Johnson New York ASA Chapter Workshop CUNY Graduate Center New York, NY hspace1in December 17, 2009 December

More information

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept Computational Biology Lecture #3: Probability and Statistics Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept 26 2005 L2-1 Basic Probabilities L2-2 1 Random Variables L2-3 Examples

More information

Bayesian inference: what it means and why we care

Bayesian inference: what it means and why we care Bayesian inference: what it means and why we care Robin J. Ryder Centre de Recherche en Mathématiques de la Décision Université Paris-Dauphine 6 November 2017 Mathematical Coffees Robin Ryder (Dauphine)

More information

Introduction to Applied Bayesian Modeling. ICPSR Day 4

Introduction to Applied Bayesian Modeling. ICPSR Day 4 Introduction to Applied Bayesian Modeling ICPSR Day 4 Simple Priors Remember Bayes Law: Where P(A) is the prior probability of A Simple prior Recall the test for disease example where we specified the

More information

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1 Parameter Estimation William H. Jefferys University of Texas at Austin bill@bayesrules.net Parameter Estimation 7/26/05 1 Elements of Inference Inference problems contain two indispensable elements: Data

More information

A short diversion into the theory of Markov chains, with a view to Markov chain Monte Carlo methods

A short diversion into the theory of Markov chains, with a view to Markov chain Monte Carlo methods A short diversion into the theory of Markov chains, with a view to Markov chain Monte Carlo methods by Kasper K. Berthelsen and Jesper Møller June 2004 2004-01 DEPARTMENT OF MATHEMATICAL SCIENCES AALBORG

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Introduction to Probabilistic Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB

More information

INTRODUCTION TO BAYESIAN STATISTICS

INTRODUCTION TO BAYESIAN STATISTICS INTRODUCTION TO BAYESIAN STATISTICS Sarat C. Dass Department of Statistics & Probability Department of Computer Science & Engineering Michigan State University TOPICS The Bayesian Framework Different Types

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Generative Models Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1

More information

10. Exchangeability and hierarchical models Objective. Recommended reading

10. Exchangeability and hierarchical models Objective. Recommended reading 10. Exchangeability and hierarchical models Objective Introduce exchangeability and its relation to Bayesian hierarchical models. Show how to fit such models using fully and empirical Bayesian methods.

More information

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet. Stat 535 C - Statistical Computing & Monte Carlo Methods Arnaud Doucet Email: arnaud@cs.ubc.ca 1 CS students: don t forget to re-register in CS-535D. Even if you just audit this course, please do register.

More information

Adaptive Monte Carlo methods

Adaptive Monte Carlo methods Adaptive Monte Carlo methods Jean-Michel Marin Projet Select, INRIA Futurs, Université Paris-Sud joint with Randal Douc (École Polytechnique), Arnaud Guillin (Université de Marseille) and Christian Robert

More information

MARKOV CHAIN MONTE CARLO

MARKOV CHAIN MONTE CARLO MARKOV CHAIN MONTE CARLO RYAN WANG Abstract. This paper gives a brief introduction to Markov Chain Monte Carlo methods, which offer a general framework for calculating difficult integrals. We start with

More information

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/?? to Bayesian Methods Introduction to Bayesian Methods p.1/?? We develop the Bayesian paradigm for parametric inference. To this end, suppose we conduct (or wish to design) a study, in which the parameter

More information

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

SAMPLING ALGORITHMS. In general. Inference in Bayesian models SAMPLING ALGORITHMS SAMPLING ALGORITHMS In general A sampling algorithm is an algorithm that outputs samples x 1, x 2,... from a given distribution P or density p. Sampling algorithms can for example be

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

Markov Networks.

Markov Networks. Markov Networks www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts Markov network syntax Markov network semantics Potential functions Partition function

More information

Part III. A Decision-Theoretic Approach and Bayesian testing

Part III. A Decision-Theoretic Approach and Bayesian testing Part III A Decision-Theoretic Approach and Bayesian testing 1 Chapter 10 Bayesian Inference as a Decision Problem The decision-theoretic framework starts with the following situation. We would like to

More information

Classical and Bayesian inference

Classical and Bayesian inference Classical and Bayesian inference AMS 132 Claudia Wehrhahn (UCSC) Classical and Bayesian inference January 8 1 / 11 The Prior Distribution Definition Suppose that one has a statistical model with parameter

More information

Computational Perception. Bayesian Inference

Computational Perception. Bayesian Inference Computational Perception 15-485/785 January 24, 2008 Bayesian Inference The process of probabilistic inference 1. define model of problem 2. derive posterior distributions and estimators 3. estimate parameters

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Introduction to Bayesian Statistics

Introduction to Bayesian Statistics Bayesian Parameter Estimation Introduction to Bayesian Statistics Harvey Thornburg Center for Computer Research in Music and Acoustics (CCRMA) Department of Music, Stanford University Stanford, California

More information

Lecture 6: Markov Chain Monte Carlo

Lecture 6: Markov Chain Monte Carlo Lecture 6: Markov Chain Monte Carlo D. Jason Koskinen koskinen@nbi.ku.dk Photo by Howard Jackman University of Copenhagen Advanced Methods in Applied Statistics Feb - Apr 2016 Niels Bohr Institute 2 Outline

More information

Monte Carlo in Bayesian Statistics

Monte Carlo in Bayesian Statistics Monte Carlo in Bayesian Statistics Matthew Thomas SAMBa - University of Bath m.l.thomas@bath.ac.uk December 4, 2014 Matthew Thomas (SAMBa) Monte Carlo in Bayesian Statistics December 4, 2014 1 / 16 Overview

More information

Bayesian RL Seminar. Chris Mansley September 9, 2008

Bayesian RL Seminar. Chris Mansley September 9, 2008 Bayesian RL Seminar Chris Mansley September 9, 2008 Bayes Basic Probability One of the basic principles of probability theory, the chain rule, will allow us to derive most of the background material in

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Statistical Theory MT 2006 Problems 4: Solution sketches

Statistical Theory MT 2006 Problems 4: Solution sketches Statistical Theory MT 006 Problems 4: Solution sketches 1. Suppose that X has a Poisson distribution with unknown mean θ. Determine the conjugate prior, and associate posterior distribution, for θ. Determine

More information

Parameter Learning With Binary Variables

Parameter Learning With Binary Variables With Binary Variables University of Nebraska Lincoln CSCE 970 Pattern Recognition Outline Outline 1 Learning a Single Parameter 2 More on the Beta Density Function 3 Computing a Probability Interval Outline

More information

7. Estimation and hypothesis testing. Objective. Recommended reading

7. Estimation and hypothesis testing. Objective. Recommended reading 7. Estimation and hypothesis testing Objective In this chapter, we show how the election of estimators can be represented as a decision problem. Secondly, we consider the problem of hypothesis testing

More information

Graphical Models and Kernel Methods

Graphical Models and Kernel Methods Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.

More information

Introduc)on to Bayesian Methods

Introduc)on to Bayesian Methods Introduc)on to Bayesian Methods Bayes Rule py x)px) = px! y) = px y)py) py x) = px y)py) px) px) =! px! y) = px y)py) y py x) = py x) =! y "! y px y)py) px y)py) px y)py) px y)py)dy Bayes Rule py x) =

More information

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 Nasser Sadeghkhani a.sadeghkhani@queensu.ca There are two main schools to statistical inference: 1-frequentist

More information

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida Bayesian Statistical Methods Jeff Gill Department of Political Science, University of Florida 234 Anderson Hall, PO Box 117325, Gainesville, FL 32611-7325 Voice: 352-392-0262x272, Fax: 352-392-8127, Email:

More information

Introduction to Bayesian Methods

Introduction to Bayesian Methods Introduction to Bayesian Methods Jessi Cisewski Department of Statistics Yale University Sagan Summer Workshop 2016 Our goal: introduction to Bayesian methods Likelihoods Priors: conjugate priors, non-informative

More information

The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations

The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations John R. Michael, Significance, Inc. and William R. Schucany, Southern Methodist University The mixture

More information

Bioinformatics: Biology X

Bioinformatics: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA Model Building/Checking, Reverse Engineering, Causality Outline 1 Bayesian Interpretation of Probabilities 2 Where (or of what)

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Monte Carlo Methods. Leon Gu CSD, CMU

Monte Carlo Methods. Leon Gu CSD, CMU Monte Carlo Methods Leon Gu CSD, CMU Approximate Inference EM: y-observed variables; x-hidden variables; θ-parameters; E-step: q(x) = p(x y, θ t 1 ) M-step: θ t = arg max E q(x) [log p(y, x θ)] θ Monte

More information

Answers and expectations

Answers and expectations Answers and expectations For a function f(x) and distribution P(x), the expectation of f with respect to P is The expectation is the average of f, when x is drawn from the probability distribution P E

More information

An overview of Bayesian analysis

An overview of Bayesian analysis An overview of Bayesian analysis Benjamin Letham Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA bletham@mit.edu May 2012 1 Introduction and Notation This work provides

More information

Modern Methods of Statistical Learning sf2935 Auxiliary material: Exponential Family of Distributions Timo Koski. Second Quarter 2016

Modern Methods of Statistical Learning sf2935 Auxiliary material: Exponential Family of Distributions Timo Koski. Second Quarter 2016 Auxiliary material: Exponential Family of Distributions Timo Koski Second Quarter 2016 Exponential Families The family of distributions with densities (w.r.t. to a σ-finite measure µ) on X defined by R(θ)

More information

ST 740: Model Selection

ST 740: Model Selection ST 740: Model Selection Alyson Wilson Department of Statistics North Carolina State University November 25, 2013 A. Wilson (NCSU Statistics) Model Selection November 25, 2013 1 / 29 Formal Bayesian Model

More information

The Bayesian Choice. Christian P. Robert. From Decision-Theoretic Foundations to Computational Implementation. Second Edition.

The Bayesian Choice. Christian P. Robert. From Decision-Theoretic Foundations to Computational Implementation. Second Edition. Christian P. Robert The Bayesian Choice From Decision-Theoretic Foundations to Computational Implementation Second Edition With 23 Illustrations ^Springer" Contents Preface to the Second Edition Preface

More information

Chapter 5. Bayesian Statistics

Chapter 5. Bayesian Statistics Chapter 5. Bayesian Statistics Principles of Bayesian Statistics Anything unknown is given a probability distribution, representing degrees of belief [subjective probability]. Degrees of belief [subjective

More information

COS513 LECTURE 8 STATISTICAL CONCEPTS

COS513 LECTURE 8 STATISTICAL CONCEPTS COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions

More information

CS 630 Basic Probability and Information Theory. Tim Campbell

CS 630 Basic Probability and Information Theory. Tim Campbell CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability Theory Probability Theory is the study of how best to predict outcomes of events. An experiment (or trial or event)

More information

Bayesian philosophy Bayesian computation Bayesian software. Bayesian Statistics. Petter Mostad. Chalmers. April 6, 2017

Bayesian philosophy Bayesian computation Bayesian software. Bayesian Statistics. Petter Mostad. Chalmers. April 6, 2017 Chalmers April 6, 2017 Bayesian philosophy Bayesian philosophy Bayesian statistics versus classical statistics: War or co-existence? Classical statistics: Models have variables and parameters; these are

More information

Bayesian Inference: Posterior Intervals

Bayesian Inference: Posterior Intervals Bayesian Inference: Posterior Intervals Simple values like the posterior mean E[θ X] and posterior variance var[θ X] can be useful in learning about θ. Quantiles of π(θ X) (especially the posterior median)

More information