Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Size: px

Start display at page:

Download "Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence"

Ralph Peregrine Evans
5 years ago
Views:

1 Bayesian Inference in GLMs Frequentists typically base inferences on MLEs, asymptotic confidence limits, and log-likelihood ratio tests Bayesians base inferences on the posterior distribution of the unknowns of interest. Suppose, we have a GLM: η i = g(µ i ) = x iβ To complete a Bayesian specification of the GLM, we need to choose a prior density for the parameters (β, φ), π(β, φ). 1

2 The posterior density is then expressed as: π(β, φ y) = f(y; β, φ) π(β, φ) f(y; β, φ) π(β, φ)dβ dφ f(y; β, φ) π(β, φ) = exp [ n i=1 = f(y; β, φ) π(β, φ) π(y) {y i θ i b(θ i )}/a(φ) + c(y i, φ) ] π(β, φ) where π(y) is the marginal likelihood of the data, obtained by integrating the likelihood conditional on the unknown regression coefficients β and dispersion parameter φ across the prior density. 2

3 Some Advantages of the Bayesian Approach: 1. Confidence limits and posterior probabilities are more intuitive: P-value: probability under H 0 of data at least as extreme as that actually observed (What?). Posterior probability: probability of H 0 given the current data and outside information. 3

4 95% Confidence Interval: in repeated sampling the interval will contain the true parameter approximately 95% of time. 95% Credible Interval: ranges from 2.5th to 97.5th percentile of posterior density, so that the true parameter falls within this interval with 95% probability 4

5 2. Provides natural framework for formalizing the process of learning from the current data to obtain updated beliefs. 3. Flexible in the incorporation of historical data and outside information (e.g., order restrictions, knowledge of plausible range for parameter, etc). 5

6 4. Exact posterior distributions can be estimated using Markov chain Monte Carlo (MCMC) - does not rely on asymptotic normality as does MLE-based inferences. 5. Since MCMC methods are so general, more realistic models can be formulated without as many computational problems. 6

7 Normal Linear Model: Suppose y i N(x iβ, φ 1 ) and we choose the prior π(β, φ) = π(β)π(φ), with π(β) = d N(β 0, Σ 0 ) and π(φ) = G(a 0, b 0 ), then the posterior density of β is π(β φ, y, X) N(β; β 0, Σ 0 ) n exp [ 1 2 exp [ 1 2 = exp [ 1 2 = exp [ 1 2 i=1 N(y i ; x iβ, φ 1 ) { (β β0 ) Σ 1 0 (β β 0 ) + φ(y Xβ) (y Xβ) }] { β Σ 1 0 β 2β Σ 1 0 β 0 + φβ X Xβ 2φβ X y }] { β (Σ φx X)β 2β (Σ 1 0 β 0 + φx y) }] { β Σ 1 β N(β; β, Σ β ), β 2β Σ 1 β β }] where Σ β = (Σ φx X) 1 is the posterior covariance and β = Σ β (Σ 1 0 β 0 + φx y) is the posterior mean. 7

8 Homework Exercise (Turn in Next Thursday): 1. Derive the posterior density π(φ β, y, X). 2. Simulate y i N( 1 + x i, 1), for i = 1,..., 100 and x i N(0, 1). 3. Choose the priors, (β 1, β 2 ) N(0, diag(10, 10)) and φ G(0.01, 0.01). 4. Starting at the prior means, alternately sample from (a) π(β φ, y, X) (b) π(φ β, y, X). 5. Plot iterations for β 1, β 2, φ. 6. Comment on convergence & provide posterior summaries of the parameters. 8

9 Sampling-Based Approaches to Bayesian Estimation In most problems, posterior distribution is not available in closed form and standard integral approximations can perform poorly. Calculation of the posterior density typically involves high dimensional integration, with no analytic solution available. 9

10 Sampling approach: 1. Construct an algorithm for simulating a long chain of draws from the posterior distribution. 2. Base inferences on posterior summaries of the parameters or functionals of the parameters calculated from the samples. 10

11 Markov chain Monte Carlo (MCMC) Algorithms Gibbs Sampler (see Casella and George, 1992): Repeatedly samples each parameter from its full conditional posterior distribution given the current values of the other parameters. Under some regularity conditions (Gelfand and Smith, 1990), samples converge to a stationary distribution that is the joint posterior distribution. Requires algorithm for sampling from full conditional distributions. 11

12 Metropolis-Hastings Algorithm (see Chib and Greenberg, 1995): Sample a candidate for a parameter from a candidate generating density (e.g., normal centered on the previous value of the parameter). Accept the candidate with probability equal to the minimum of one and the ratios of the posterior probabilities at the new and old values of the parameter multiplied by a correction for asymmetric candidate generating densities. 12

13 Repeat for all the parameters and for a large # of iterations. Does not require closed form full conditionals but efficiency can be strongly dependent on choice of candidate generating density & tuning may be needed. 13

14 Bayesian Analyses of Binary Response Models Define a binary regression model as p i = Pr(y i = 1 x i, β) = h(x iβ), where h( ) is a known cdf Let π(β) be a prior density for the regression coefficients, β. Then the posterior density of β is given by π(β data) = π(β) n i=1 h(x iβ) y i{1 h(x iβ)} 1 y i π(β) n i=1 h(x iβ) y i {1 h(x i β)} 1 y i dβ, 14

15 Typically, the integral in the denominator of the above expression is intractable, but one can use an asymptotic approximation. In particular, we can use the approximation that π(β data) d N( β, I( β) 1 ), where β is the posterior mode and I( β) is the negative of the second derivative matrix evaluated at the mode 15

16 When the improper uniform prior, π(β) 1, is chosen, β is the MLE and I() is the observed information matrix The normal approximation (which you may note is typically the basis for frequentist inference on binary response glms) is often biased for small samples 16

17 Various Markov chain Monte Carlo (MCMC) approaches have been proposed for posterior computation of binary response GLMs The most commonly used algorithms are the Gibbs sampler: Implemented using Adaptive RejectionSampling & Data augmentation [FOR PROBIT MODELS] 17

18 The Metropolis-Hastings algorithm can be used in general: Requires choice of a candidate-generating (i.e., proposal density) and possibly tuning of this density. Choice of density can greatly affect efficiency and is not necessarily straightforward. 18

19 Gibbs Sampling via Adaptive Rejection Sampling (ARS; Gilks and Wild, 1992; Dellaportas and Smith, 1993) ARS is a general algorithm for sampling from log-concave densities that do not have a closed form If the likelihood and prior are log-concave with respect to each of the regression parameters, then the ARS algorithm can be used to sample from the full conditional posteriors of each parameter given the other parameters and data. 19

20 Hence, in such cases, the ARS algorithm can be used to implement Gibbs sampling. Most commonly-used glms have the log concavity property when log-concave priors are chosen (basis of WinBUGS software). WinBUGS is freely available to be downloaded from the web and can be used to easily implement Bayesian analyses of a very wide class of models 20

21 Algorithm: 1. Choose a prior density and initial values for β 2. For j = 1,..., p, draw a value from the full conditional posterior density, π(β j β (j), data), where β (j) = {β k : k j, k = 1,..., p}. Although this conditional distribution is not available in closed form, except in special cases, we can sample from this distribution using ARS. 3. Repeat step 2 for a large number of iterations. Discard an initial burn-in period to allow convergence to a stationary distribution (which is the joint posterior density under mild regularity conditions). Calculate summaries of the posterior of β based on a large number of additional draws 21

22 Rejection Sampling General method for sampling from a density f(x), which is possibly unnormalized (g(x) = cf(x)). 1. Define an envelope function, g u (x) such that g u (x) g(x) for all x D. 2. (Optionally) define a squeezing function g 1 (x) such that g 1 (x) g(x) for all x D. 22

23 3. Repeat the following sampling step until n samples have been accepted: x g u and w U(0, 1). If you have a squeezing function, then accept x if w g 1 (x )/g u (x ). If you don t have a squeezing function, then accept x if w g(x )/g u (x ). 23

24 Rejection sampling is useful when it is easy to sample from g u, but not from f directly. Unfortunately, it can be difficult to find suitable envelope and squeezing functions in practice. 24

25 Adaptive Rejection Sampling Reduces the number of evaluations of g(x) by 1. Assumes log-concavity, for h(x) = log g(x), h (x) = dh(x)/dx decreases with increasing x D, to avoid the need to identify the sup{g(x) : x D}. 2. After each rejection, the envelope and squeezing functions are updated to incorporate the new info about g(x). 25

26 Latent Variable Models for Binary Data Suppose that for a given vector of explanatory variables x, the latent variable, U, has a continuous cumulative distribution function F (u; x) and that the binary response Y = 1 is recorded if and only if U > 0: θ = Pr(Y = 1 x) = 1 F (0; x). Since U is not directly observed there is no loss of generality in taking the critical (i.e., cutoff point) to be 0. In addition, we can take the standard deviation of U (or some other measure of dispersion) to be 1, without loss of generality. 26

27 Probit Models For example, if U N(x β, 1) it follows that θ i = Pr(Y = 1 x i ) = Φ(x iβ), where Φ( ) is the cumulative normal distribution function Φ(t) = (2π) 1/2 t exp( 1 2 z2 ) dz. The relation is linearized by the inverse normal transformation Φ 1 (θ) = x iβ = p j=1 x ij β j. 27

28 We have regarded the cutoff value of U as fixed and the mean of U to be changing with x. Alternatively, one could assume that the distribution of U is fixed and allow the critical value to vary with x (e.g., dose) In toxicology studies where dose is the explanatory variable it makes sense to let V denote the minimum level of dose needed to produce a response (i.e., tolerance) 28

29 Under the second formulation, y i = 1 if x i β > v i It follows that Pr(Y = 1 x i ) = Pr(V x iβ). Note that the shape of the dose-response curve is determined by the distribution function of V If V N(0, 1), then Pr(Y = 1 x i ) = Φ(x iβ), and it follows that the U and V formulations are equivalent The U formulation is more common 29

30 Latent Utilities & Choice Models Suppose that Fred is choosing between 2 brands of a product (say, Ben & Jerry s or Haigen Daz) Fred has a utility for Ben & Jerry s (denoted by Z i1 ) and a utility for Haigen Daz (denotes by Z i2 ) Letting the difference in utilities be represented by the normal linear model, we have U i = Z i1 Z i2 = x iβ + ɛ i, where ɛ i N(0, 1). If Fred has a higher utility for Ben & Jerry s, then Z i1 > Z i2, U i > 0, and Fred will choose Ben & Jerry s (Y i = 1) 30

31 This latent utility formulation is again equivalent to a probit model for the binary response. The generalization to a multinomial response is straightforward: Introduce k latent utilities instead of 2 Individual s response (i.e., choice) corresponds to the category with maximum utility Referred to as discrete choice model 31

32 Comments: Certain types of models are preferred in certain applications - probit common in bioassay & social sciences. Ability to calculate posterior densities for any functional makes parameter interpretation less important for Bayesian approach. The model will ideally be motivated by prior information and fit to the current data. Practically, computational convenience also plays a role, particularly in complex settings. 32

33 Logistic Regression The normal form is only one possibility for the distribution of U. Another is the logistic distribution with location x iβ and unit scale. The logistic distribution has cumulative distribution function so that F (u) = exp(u x iβ) 1 + exp(u x iβ), F (0; x i ) = 1/{1 + exp(x iβ)}, 33

34 It follows that Pr(Y = 1 x i ) = Pr(U > 0 x i ) = 1 F (0; x i ) = 1/{1+exp( x iβ)}. To linearize this relation, we take the logit transformation of both sides, log{θ i /(1 θ i )} = x iβ. 34

35 Some Generalizations of the Logistic Model Logistic regression assumes a restricted dose-response shape - possible to relax restriction Aranda-Ordaz (1981) proposed two families of linearizing transformations, which can easily be inverted and which span a range of forms. 35

36 The first, which is restricted to symmetric cases (i.e., invariant to interchanging success & failure) is 2 θ ν (1 θ) ν ν θ ν + (1 θ). ν In the limit as ν 0, this is logistic and for ν = 1 this is linear The second family has log[{(1 θ) ν 1}/ν], which reduces to the extreme value model when ν = 0 and the logistic when ν = 1. 36

37 When there is doubt about the transformation, a formal approach is to use one or the other of the above transformations and to fit the resulting model for a range of possible values for ν A profile likelihood can be obtained for ν by plotting the maximized likelihood against ν (Frequentist) 37

38 Potentially, one could choose a standard form, such as the logistic, if the corresponding value of ν falls with the 95% profile likelihood confidence region. Alternatively, we could choose a prior density for ν and implement a Bayesian approach. 38

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Latent Variable Models for Binary Data Suppose that for a given vector of explanatory variables x, the latent variable, U, has a continuous cumulative distribution function F (u; x) and that the binary