STAT 425: Introduction to Bayesian Analysis Marina Vannucci Rice University, USA Fall 2017 Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 2017 1 / 10
Lecture 7: Prior Types Subjective and objective priors Non-informative priors Jeffreys priors Diffuse priors Bayes factors for hypothesis testing Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 2017 2 / 10
Choice of Prior Distribution What is the interpretation of probability? Objective Probability As normative and objective representations of what is rational to believe about a parameter, usually in a state of ignorance (dice, coin, etc.) Frequentist probability Long run frequency when a process is repeated Subjective Probability Personal judgment about an event. But, one must be coherent. Two general types of Priors: Subjective Priors Priors chosen to reflect expert opinion or personal beliefs Objective Priors Priors chosen to let the data (i.e., likelihood) dominate the posterior distribution, and hence inference. These are generally determined based on the sampling model in use. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 2017 3 / 10
Subjective Priors Ways of Specifying For single events: Relative probability of A to not A (as for lotteries/betting) Continuous quantities: prob. distribution divide range into intervals, and assign probabilities for each (as for events, above) or specify percentiles. Then, fit a smooth probability density. Examples: (i) θ Beta(α, β). From expert: mean=.3 and s.d.=0.1. Then solve for α and β and find Beta(9.2,13.8). (ii) θ N(µ, s). From expert: an educated guess of 25 and a 95% confidence that θ is between 10 and 40. Then, θ N(25, 30/4). Subjective priors may be classified into two classes: Conjugate Priors Non-conjugate priors Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 2017 4 / 10
Non-informative or Objective Priors Goal: Choose priors which reflect a state of no-information about θ. Central problem: specifying a prior distribution for a parameter about which nothing is known. For example, if θ can only have a finite set of values, it seems natural to assume all values equally likely a priori. Such specifications fall into the general class of noninformative priors. Roughly speaking a prior is non-informative if it has minimal impact on the posterior, i.e. it does not change much over the region where the likelihood is appreciable. This notion was first used by Bayes(1763), and Laplace(1812) in Astronomy. A formal development is given in Box and Tiao (1973) and in Kass and Wasserman (1996, JASA). Still the notion is not adequately well-defined. Two types: Reference or default priors Diffuse priors Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 2017 5 / 10
Reference Priors Example: x = (x 1,..., x n ) from N(θ, σ 2 ), θ unknown, σ known. What is the prior that reflects a state of no information about θ and lets the likelihood dominate the posterior? The flat prior: π(θ) = 1 for < θ < Some features of the flat prior: Non-informative: All values of θ are equally likely. No information. Likelihood dominates posterior: p(θ x) L(θ)π(θ) = L(θ) May not be a probability distribution, i.e. π(θ)dθ = ( improper prior ). Not a major concern if posterior is proper. Ex: X Poisson(λ), 0 < λ <, λ Ga(α, β). Then α = 1, β 0 gives π(λ) 1 but with a proper posterior λ x Ga(α + t, n) (also, example above). In complex models, however, posterior may also be improper and one cannot make inference. Other default priors: π(λ) 1 [i.e.λ Ga(1/2, β 0)] and λ 1/2 π(λ) 1 λ [i.e.λ Ga(0, β 0)] (both with proper posterior). Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 2017 6 / 10
Change of variable Another concern with reference priors: The constant/flat rule prior is not transformation invariant. If τ = g(θ) is a monotone function of θ with inverse g 1 (τ) = θ then the density of τ is obtained from the density of θ as p(τ) = p(g 1 (τ)) g 1 (τ), with Jacobian J = g 1 (τ) τ τ Ex: π(θ) = 1 and τ = e θ. Implied π(τ) = 1/τ, no-longer a uniform. Binomial model with θ Beta(1, 1) uniform for 0 θ 1 θ Beta(0, 0) uniform for log odds, τ = log[θ/(1 θ)] θ Beta(1, 1) uniform for odds, τ = θ/(1 θ) A density cannot be simultaneously uniform in all transformations. Problem with uniform distribution as non-informative (no information on a transformation of θ does not imply no information on θ). Remedy: Jeffreys priors. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 2017 7 / 10
Jeffreys Priors Let p(x θ) be a (single-parameter) probability model. The Fisher information is defined as [ 2 log p(x θ) ] I(θ) = E x θ θ 2 and the Jeffreys prior is π(θ) I(θ). Properties: Locally uniform, i.e., flat in the region of the likelihood. Transformation invariant, ie. all parameterizations lead to same prior π(θ) = I(θ) and π(τ) = I(τ) with τ = g(θ) same prior obtained whether (i) apply Jeffreys rule to get π(θ) and then transform to get π(τ) or (ii) transform to τ and then apply Jeffreys rule to get π(τ). Caution: Posterior may not be proper. Does not work well for multi-parameter models, θ = (θ 1,..., θ k ), I(θ) is a matrix of partial derivatives and π(θ) I(θ). Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 2017 8 / 10
Examples Binomial: x Binomial(n, θ), log(p(x θ)) = x log(θ) + (n x) log(1 θ), E[x θ] = nθ, 2 log(p(x θ)) θ 2 = x (n x) nθ (n nθ), I(θ) = + θ2 (1 θ) 2 θ2 (1 θ) 2 = n θ(1 θ) Jeffreys prior is π(θ) θ 1/2 (1 θ) 1/2, i.e., θ Beta(1/2, 1/2) (conjugate, proper) Poisson: x Poisson(λ), log(p(x λ)) = λ + x log(λ) log(x!) (single obs), E[x λ] = λ 2 log(p(x λ)) λ 2 = x λ 2, I(λ) = 1 λ Jeffreys prior is π(λ) λ 1/2, i.e., θ Ga(1/2, β 0) (conjugate, improper with proper posterior) Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 2017 9 / 10
Diffuse Priors Motivation is to use a proper prior with little info about the parameter. Conjugate priors are proper The spread of a prior is a measure of the amount of uncertainty about the parameter expressed by the prior. So, choose a conjugate prior with large standard deviation as a diffuse or weakly informative prior. Examples: (i) X Poisson(λ), 0 < λ <, λ Ga(α, β). A diffuse prior is Ga(α, 1) with α the prior expectation. (ii) X N(θ, σ 2 ) with σ 2 known. A diffuse prior for θ is N(µ 0, 1000). HTTP://WWW.AIACCESS.NET/E GM.HTM Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 2017 10 / 10