Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Similar documents
Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Bayesian Multivariate Logistic Regression

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

eqr094: Hierarchical MCMC for Bayesian System Reliability

MCMC algorithms for fitting Bayesian models

Principles of Bayesian Inference

Bayesian Linear Regression

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Markov Chain Monte Carlo, Numerical Integration

Bayesian Linear Models

Lecture 8: The Metropolis-Hastings Algorithm

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

STA 4273H: Statistical Machine Learning

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

The Ising model and Markov chain Monte Carlo

Bayesian Linear Models

STA 216, GLM, Lecture 16. October 29, 2007

Bayesian Methods for Machine Learning

Introduction to Machine Learning CMU-10701

Monte Carlo in Bayesian Statistics

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture 15-7th March Arnaud Doucet

Bayesian Inference in the Multivariate Probit Model

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

Bayesian GLMs and Metropolis-Hastings Algorithm

CPSC 540: Machine Learning

Generalized Linear Models

Generalized Linear Models. Kurt Hornik

BAYESIAN ANALYSIS OF BINARY REGRESSION USING SYMMETRIC AND ASYMMETRIC LINKS

Computational statistics

Markov Chain Monte Carlo methods

Bayesian Nonparametric Regression for Diabetes Deaths

Reminder of some Markov Chain properties:

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

ST 740: Markov Chain Monte Carlo

Markov Chain Monte Carlo (MCMC)

Econometrics Lecture 5: Limited Dependent Variable Models: Logit and Probit

The Polya-Gamma Gibbs Sampler for Bayesian. Logistic Regression is Uniformly Ergodic

On Bayesian Computation

Spatial Statistics Chapter 4 Basics of Bayesian Inference and Computation

Online appendix to On the stability of the excess sensitivity of aggregate consumption growth in the US

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Likelihood Inference for Lattice Spatial Processes

MCMC: Markov Chain Monte Carlo

Markov chain Monte Carlo

Computer intensive statistical methods

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Markov Chain Monte Carlo methods

A Bayesian Mixture Model with Application to Typhoon Rainfall Predictions in Taipei, Taiwan 1

Gibbs Sampling in Latent Variable Models #1

LECTURE 15 Markov chain Monte Carlo

Bayesian Methods with Monte Carlo Markov Chains II

Practical Bayesian Quantile Regression. Keming Yu University of Plymouth, UK

Markov Chain Monte Carlo Methods

Session 3A: Markov chain Monte Carlo (MCMC)

Lecture 16: Mixtures of Generalized Linear Models

Wageningen Summer School in Econometrics. The Bayesian Approach in Theory and Practice

Bayes: All uncertainty is described using probability.

Markov Chain Monte Carlo

Bayesian Linear Models

Bayesian Inference and MCMC

The linear model is the most fundamental of all serious statistical models encompassing:

Lecture Notes based on Koop (2003) Bayesian Econometrics

Bayesian Analysis for Step-Stress Accelerated Life Testing using Weibull Proportional Hazard Model

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Lecture 5: Spatial probit models. James P. LeSage University of Toledo Department of Economics Toledo, OH

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Default Priors and Effcient Posterior Computation in Bayesian

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood

Markov Chain Monte Carlo A Contribution to the Encyclopedia of Environmetrics

Stat 5101 Lecture Notes

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

17 : Markov Chain Monte Carlo

Generalized Linear Models Introduction

MONTE CARLO METHODS. Hedibert Freitas Lopes

F denotes cumulative density. denotes probability density function; (.)

Generalized linear models

Bayesian Regression Linear and Logistic Regression

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Bayesian Phylogenetics:

Answers and expectations

Metropolis-Hastings Algorithm

ABC methods for phase-type distributions with applications in insurance risk problems

Bayes methods for categorical data. April 25, 2017

Sampling Algorithms for Probabilistic Graphical models

7. Estimation and hypothesis testing. Objective. Recommended reading

Principles of Bayesian Inference

Bayesian Hypothesis Testing in GLMs: One-Sided and Ordered Alternatives. 1(w i = h + 1)β h + ɛ i,

16 : Markov Chain Monte Carlo (MCMC)

Machine Learning. Probabilistic KNN.

POSTERIOR ANALYSIS OF THE MULTIPLICATIVE HETEROSCEDASTICITY MODEL

Principles of Bayesian Inference

Down by the Bayes, where the Watermelons Grow

Computer intensive statistical methods

Kazuhiko Kakamu Department of Economics Finance, Institute for Advanced Studies. Abstract

Markov Chain Monte Carlo and Applied Bayesian Statistics

VCMC: Variational Consensus Monte Carlo

Transcription:

Bayesian Inference in GLMs Frequentists typically base inferences on MLEs, asymptotic confidence limits, and log-likelihood ratio tests Bayesians base inferences on the posterior distribution of the unknowns of interest. Suppose, we have a GLM: η i = g(µ i ) = x iβ To complete a Bayesian specification of the GLM, we need to choose a prior density for the parameters (β, φ), π(β, φ). 1

The posterior density is then expressed as: π(β, φ y) = f(y; β, φ) π(β, φ) f(y; β, φ) π(β, φ)dβ dφ f(y; β, φ) π(β, φ) = exp [ n i=1 = f(y; β, φ) π(β, φ) π(y) {y i θ i b(θ i )}/a(φ) + c(y i, φ) ] π(β, φ) where π(y) is the marginal likelihood of the data, obtained by integrating the likelihood conditional on the unknown regression coefficients β and dispersion parameter φ across the prior density. 2

Some Advantages of the Bayesian Approach: 1. Confidence limits and posterior probabilities are more intuitive: P-value: probability under H 0 of data at least as extreme as that actually observed (What?). Posterior probability: probability of H 0 given the current data and outside information. 3

95% Confidence Interval: in repeated sampling the interval will contain the true parameter approximately 95% of time. 95% Credible Interval: ranges from 2.5th to 97.5th percentile of posterior density, so that the true parameter falls within this interval with 95% probability 4

2. Provides natural framework for formalizing the process of learning from the current data to obtain updated beliefs. 3. Flexible in the incorporation of historical data and outside information (e.g., order restrictions, knowledge of plausible range for parameter, etc). 5

4. Exact posterior distributions can be estimated using Markov chain Monte Carlo (MCMC) - does not rely on asymptotic normality as does MLE-based inferences. 5. Since MCMC methods are so general, more realistic models can be formulated without as many computational problems. 6

Normal Linear Model: Suppose y i N(x iβ, φ 1 ) and we choose the prior π(β, φ) = π(β)π(φ), with π(β) = d N(β 0, Σ 0 ) and π(φ) = G(a 0, b 0 ), then the posterior density of β is π(β φ, y, X) N(β; β 0, Σ 0 ) n exp [ 1 2 exp [ 1 2 = exp [ 1 2 = exp [ 1 2 i=1 N(y i ; x iβ, φ 1 ) { (β β0 ) Σ 1 0 (β β 0 ) + φ(y Xβ) (y Xβ) }] { β Σ 1 0 β 2β Σ 1 0 β 0 + φβ X Xβ 2φβ X y }] { β (Σ 1 0 + φx X)β 2β (Σ 1 0 β 0 + φx y) }] { β Σ 1 β N(β; β, Σ β ), β 2β Σ 1 β β }] where Σ β = (Σ 1 0 + φx X) 1 is the posterior covariance and β = Σ β (Σ 1 0 β 0 + φx y) is the posterior mean. 7

Homework Exercise (Turn in Next Thursday): 1. Derive the posterior density π(φ β, y, X). 2. Simulate y i N( 1 + x i, 1), for i = 1,..., 100 and x i N(0, 1). 3. Choose the priors, (β 1, β 2 ) N(0, diag(10, 10)) and φ G(0.01, 0.01). 4. Starting at the prior means, alternately sample from (a) π(β φ, y, X) (b) π(φ β, y, X). 5. Plot iterations 1-1000 for β 1, β 2, φ. 6. Comment on convergence & provide posterior summaries of the parameters. 8

Sampling-Based Approaches to Bayesian Estimation In most problems, posterior distribution is not available in closed form and standard integral approximations can perform poorly. Calculation of the posterior density typically involves high dimensional integration, with no analytic solution available. 9

Sampling approach: 1. Construct an algorithm for simulating a long chain of draws from the posterior distribution. 2. Base inferences on posterior summaries of the parameters or functionals of the parameters calculated from the samples. 10

Markov chain Monte Carlo (MCMC) Algorithms Gibbs Sampler (see Casella and George, 1992): Repeatedly samples each parameter from its full conditional posterior distribution given the current values of the other parameters. Under some regularity conditions (Gelfand and Smith, 1990), samples converge to a stationary distribution that is the joint posterior distribution. Requires algorithm for sampling from full conditional distributions. 11

Metropolis-Hastings Algorithm (see Chib and Greenberg, 1995): Sample a candidate for a parameter from a candidate generating density (e.g., normal centered on the previous value of the parameter). Accept the candidate with probability equal to the minimum of one and the ratios of the posterior probabilities at the new and old values of the parameter multiplied by a correction for asymmetric candidate generating densities. 12

Repeat for all the parameters and for a large # of iterations. Does not require closed form full conditionals but efficiency can be strongly dependent on choice of candidate generating density & tuning may be needed. 13

Bayesian Analyses of Binary Response Models Define a binary regression model as p i = Pr(y i = 1 x i, β) = h(x iβ), where h( ) is a known cdf Let π(β) be a prior density for the regression coefficients, β. Then the posterior density of β is given by π(β data) = π(β) n i=1 h(x iβ) y i{1 h(x iβ)} 1 y i π(β) n i=1 h(x iβ) y i {1 h(x i β)} 1 y i dβ, 14

Typically, the integral in the denominator of the above expression is intractable, but one can use an asymptotic approximation. In particular, we can use the approximation that π(β data) d N( β, I( β) 1 ), where β is the posterior mode and I( β) is the negative of the second derivative matrix evaluated at the mode 15

When the improper uniform prior, π(β) 1, is chosen, β is the MLE and I() is the observed information matrix The normal approximation (which you may note is typically the basis for frequentist inference on binary response glms) is often biased for small samples 16

Various Markov chain Monte Carlo (MCMC) approaches have been proposed for posterior computation of binary response GLMs The most commonly used algorithms are the Gibbs sampler: Implemented using Adaptive RejectionSampling & Data augmentation [FOR PROBIT MODELS] 17

The Metropolis-Hastings algorithm can be used in general: Requires choice of a candidate-generating (i.e., proposal density) and possibly tuning of this density. Choice of density can greatly affect efficiency and is not necessarily straightforward. 18

Gibbs Sampling via Adaptive Rejection Sampling (ARS; Gilks and Wild, 1992; Dellaportas and Smith, 1993) ARS is a general algorithm for sampling from log-concave densities that do not have a closed form If the likelihood and prior are log-concave with respect to each of the regression parameters, then the ARS algorithm can be used to sample from the full conditional posteriors of each parameter given the other parameters and data. 19

Hence, in such cases, the ARS algorithm can be used to implement Gibbs sampling. Most commonly-used glms have the log concavity property when log-concave priors are chosen (basis of WinBUGS software). WinBUGS is freely available to be downloaded from the web and can be used to easily implement Bayesian analyses of a very wide class of models 20

Algorithm: 1. Choose a prior density and initial values for β 2. For j = 1,..., p, draw a value from the full conditional posterior density, π(β j β (j), data), where β (j) = {β k : k j, k = 1,..., p}. Although this conditional distribution is not available in closed form, except in special cases, we can sample from this distribution using ARS. 3. Repeat step 2 for a large number of iterations. Discard an initial burn-in period to allow convergence to a stationary distribution (which is the joint posterior density under mild regularity conditions). Calculate summaries of the posterior of β based on a large number of additional draws 21

Rejection Sampling General method for sampling from a density f(x), which is possibly unnormalized (g(x) = cf(x)). 1. Define an envelope function, g u (x) such that g u (x) g(x) for all x D. 2. (Optionally) define a squeezing function g 1 (x) such that g 1 (x) g(x) for all x D. 22

3. Repeat the following sampling step until n samples have been accepted: x g u and w U(0, 1). If you have a squeezing function, then accept x if w g 1 (x )/g u (x ). If you don t have a squeezing function, then accept x if w g(x )/g u (x ). 23

Rejection sampling is useful when it is easy to sample from g u, but not from f directly. Unfortunately, it can be difficult to find suitable envelope and squeezing functions in practice. 24

Adaptive Rejection Sampling Reduces the number of evaluations of g(x) by 1. Assumes log-concavity, for h(x) = log g(x), h (x) = dh(x)/dx decreases with increasing x D, to avoid the need to identify the sup{g(x) : x D}. 2. After each rejection, the envelope and squeezing functions are updated to incorporate the new info about g(x). 25

Latent Variable Models for Binary Data Suppose that for a given vector of explanatory variables x, the latent variable, U, has a continuous cumulative distribution function F (u; x) and that the binary response Y = 1 is recorded if and only if U > 0: θ = Pr(Y = 1 x) = 1 F (0; x). Since U is not directly observed there is no loss of generality in taking the critical (i.e., cutoff point) to be 0. In addition, we can take the standard deviation of U (or some other measure of dispersion) to be 1, without loss of generality. 26

Probit Models For example, if U N(x β, 1) it follows that θ i = Pr(Y = 1 x i ) = Φ(x iβ), where Φ( ) is the cumulative normal distribution function Φ(t) = (2π) 1/2 t exp( 1 2 z2 ) dz. The relation is linearized by the inverse normal transformation Φ 1 (θ) = x iβ = p j=1 x ij β j. 27

We have regarded the cutoff value of U as fixed and the mean of U to be changing with x. Alternatively, one could assume that the distribution of U is fixed and allow the critical value to vary with x (e.g., dose) In toxicology studies where dose is the explanatory variable it makes sense to let V denote the minimum level of dose needed to produce a response (i.e., tolerance) 28

Under the second formulation, y i = 1 if x i β > v i It follows that Pr(Y = 1 x i ) = Pr(V x iβ). Note that the shape of the dose-response curve is determined by the distribution function of V If V N(0, 1), then Pr(Y = 1 x i ) = Φ(x iβ), and it follows that the U and V formulations are equivalent The U formulation is more common 29

Latent Utilities & Choice Models Suppose that Fred is choosing between 2 brands of a product (say, Ben & Jerry s or Haigen Daz) Fred has a utility for Ben & Jerry s (denoted by Z i1 ) and a utility for Haigen Daz (denotes by Z i2 ) Letting the difference in utilities be represented by the normal linear model, we have U i = Z i1 Z i2 = x iβ + ɛ i, where ɛ i N(0, 1). If Fred has a higher utility for Ben & Jerry s, then Z i1 > Z i2, U i > 0, and Fred will choose Ben & Jerry s (Y i = 1) 30

This latent utility formulation is again equivalent to a probit model for the binary response. The generalization to a multinomial response is straightforward: Introduce k latent utilities instead of 2 Individual s response (i.e., choice) corresponds to the category with maximum utility Referred to as discrete choice model 31

Comments: Certain types of models are preferred in certain applications - probit common in bioassay & social sciences. Ability to calculate posterior densities for any functional makes parameter interpretation less important for Bayesian approach. The model will ideally be motivated by prior information and fit to the current data. Practically, computational convenience also plays a role, particularly in complex settings. 32

Logistic Regression The normal form is only one possibility for the distribution of U. Another is the logistic distribution with location x iβ and unit scale. The logistic distribution has cumulative distribution function so that F (u) = exp(u x iβ) 1 + exp(u x iβ), F (0; x i ) = 1/{1 + exp(x iβ)}, 33

It follows that Pr(Y = 1 x i ) = Pr(U > 0 x i ) = 1 F (0; x i ) = 1/{1+exp( x iβ)}. To linearize this relation, we take the logit transformation of both sides, log{θ i /(1 θ i )} = x iβ. 34

Some Generalizations of the Logistic Model Logistic regression assumes a restricted dose-response shape - possible to relax restriction Aranda-Ordaz (1981) proposed two families of linearizing transformations, which can easily be inverted and which span a range of forms. 35

The first, which is restricted to symmetric cases (i.e., invariant to interchanging success & failure) is 2 θ ν (1 θ) ν ν θ ν + (1 θ). ν In the limit as ν 0, this is logistic and for ν = 1 this is linear The second family has log[{(1 θ) ν 1}/ν], which reduces to the extreme value model when ν = 0 and the logistic when ν = 1. 36

When there is doubt about the transformation, a formal approach is to use one or the other of the above transformations and to fit the resulting model for a range of possible values for ν A profile likelihood can be obtained for ν by plotting the maximized likelihood against ν (Frequentist) 37

Potentially, one could choose a standard form, such as the logistic, if the corresponding value of ν falls with the 95% profile likelihood confidence region. Alternatively, we could choose a prior density for ν and implement a Bayesian approach. 38