BAYESIAN ANALYSIS OF BINARY REGRESSION USING SYMMETRIC AND ASYMMETRIC LINKS

Size: px

Start display at page:

Download "BAYESIAN ANALYSIS OF BINARY REGRESSION USING SYMMETRIC AND ASYMMETRIC LINKS"

Roberta Bates
6 years ago
Views:

1 Sankhyā : The Indian Journal of Statistics 2000, Volume 62, Series B, Pt. 3, pp BAYESIAN ANALYSIS OF BINARY REGRESSION USING SYMMETRIC AND ASYMMETRIC LINKS By SANJIB BASU Northern Illinis University, Dekalb, USA and SAURABH MUKHOPADHYAY Merck Research Laboratories, Rahway, USA SUMMARY. Binary response regression is a useful technique for analyzing categorical data. Popular binary models use special link functions such as the logit or the probit link. In this article, the inverse link function H is modeled to be a scale mixture of cumulative distribution functions. Two different models for H are proposed: (i) H is a finite normal scale mixture with a Dirichlet distribution prior on the mixing distribution; and (ii) H is a scale mixture of truncated normal distributions with the mixing distribution having a Dirichlet prior. The second model allows symmetric as well as asymmetric links. Bayesian analyses of these models using data augmentation and Gibbs sampling are described. Model diagnostics by cross validation of the conditional predictive distributions are proposed. These analyses are illustrated in the Beetle mortality data and the Challenger o-ring distress data. 1. Introduction Consider the binary regression model P (Y i = 1) = H(x T i β), i = 1,..., N. (1) Here the binary response y i is either 0 or 1, x T i = (x 1i,..., x ki ) is the set of covariates, β = (β 1,..., β k ) T is a vector of unknown parameters and the function H is usually assumed to be known. In the terminology of Generalized Linear Models (McCullagh and Nelder (1989)), H is the inverse link function. For ease of exposition, we refer to H as the link function in this article. Popular probit and logit models are obtained if H is chosen as the standard normal cdf Φ or the cdf of the standard logistic distribution respectively. Such particular choices of H are often done for convenience and on an ad hoc basis. Paper received November 1998; revised November AMS (1991) subject classification. 62F15, 62J12. Key words and phrases. Asymmetric link, binary data, cross-validation, Dirichlet distribution, Gibbs sampling, normal scale mixture, predictive distribution.

2 bayesian analysis of binary regression 373 Binary regression with a parametric family of link functions (instead of a single fixed choice) has been explored by many, see Prentice (1976), Aranda-Ordaz (1981), Guerrero and Johnson (1982), and Stukel (1988). Their works show that such extended models can significantly improve fits. Notice that (1) requires range of H to be [0, 1], usually it is also preferable to have a nondecreasing smooth function H. These requirements match exactly with a smooth continuous cumulative distribution function (cdf). Recently, there has been strong interest in Bayesian analysis of binary and polychotomous response regression with the class (or a subclass) of cdfs as choices for the function H. See Albert and Chib (1993), Chen and Dey (1996) and the references therein. Let F be the class of all cdfs on R. it The family F includes cdfs which are often undesirable as choices for H, for example, cdfs of discrete distributions. Instead, we consider the subclass of normal scale mixture cdfs F N = { F N ( ) = [0, ) Φ( σ ) dg(σ), G is a cdf on [0, )} as possible choices for the function H. The class of normal scale mixtures allows a variety of functional forms and varying tail structures (including normal, all t distributions, Logistic, Double Exponential, and Cauchy), thus presenting us with a wide array of choices for the link H. Moreover, a normal scale mixture cdf F N is continuous, smooth, infinitely differentiable and symmetric (F N (θ) = 1 F N ( θ)). The cdf F N also does not have a wiggly structure, indeed, F N is convex on (, 0) and concave on (0, ). For H( ) = F N ( ) = Φ( σ ) dg(σ), our binary regression model (1) becomes P (Y i = 1) = Φ({x T i β}/σ) dg(σ), i = 1,..., N, (2) which includes two unknowns, β and the mixing distribution G. These two unknowns are related since the interpretation of the regression coefficient β depends on the form of the link function H( ) and hence on G. In our prior specification, however, we typically use the improper non-subjective prior π 1 (β) 1 independent of G. This specification reflects our complete uncertainty about the value of β irrespective of the form of the link function. For the mixing distribution G, we use an independent prior π 2 (G). The posterior distribution π(β, G y) which combines the prior π(β, G) and the sampling model of (2), however, is analytically intractable. Albert and Chib (1993) described a data-augmented Gibbs sampling methodology for probit and t-link models. We extend this algorithm to our case of normal scale mixture links. In section 2, we consider finite normal scale mixture links. The mixing distribution G is of the form s s p j δ {τj } where 0 p j 1, p j = 1, δ {τj } is the degenerate distribution at τ j. We assume that the support points 0 < τ 1 <... < τ s < are user-specified. The resulting link function is H( ) = mixture of normal cdfs. p = (p 1,..., p s ) T, i.e., π(g) = π(p) = constant s p j Φ( τ j ), a finite scale We assume a Dirichlet distribution (DD(ν)) prior on s p νj 1 j where ν j > 0 are user

3 374 sanjib basu and saurabh mukhopadhyay specified. The normal scale mixture link of (2) always produces a symmetric link. We develop a new family of distributions based on mixtures of truncated normals that naturally contains symmetric and asymmetric distributions. Use of truncated normal mixtures as the link H introduces many methodological and computational challenges. These are discussed in sections 3 and 4. Model checking is an integral part of any statistical analysis. We consider several cross validation model checking criteria and develop easy methods for their calculation in section 5. In section 6, we apply our proposed models to the beetle mortality data and the Challenger data of Dalal, Fowlkes and Hoadley (1989). Conclusions are given in section Finite Normal Scale Mixture Links We observe (y i, x i ), i = 1,..., N, where y i is binary and x i is a set of covariates which are either continuous or categorical. We assume that Y i s are independent Bernoulli (θ i ) where θ i = Φ({x T i β}/σ) dg(σ) as in (2). We take G = s p j δ {τj} and put a Dirichlet distribution prior π 2 (G) = DD(ν) on p = (p 1,..., p s ) T. We further assume an independent prior π 1 (β) on β. The posterior distribution for this model is analytically intractable. We use Gibbs sampling which is an extension of the sampler proposed by Albert and Chib (1993). This sampler introduces two sets of latent variables Z = (Z 1,..., Z N ) T and σ = (σ 1,..., σ N ) T. The complete model structure along with the distributions of the latent variables are given below : (a) Given β and σ; the latent variables Z 1,..., Z N are independent with Z i N(x T i β, σ2 i ). (b) Given Z; Y 1,..., Y N are completely determined with Y i = 1 if Z i > 0, and = 0 otherwise. (c) Given G; the latent variables σ 1,..., σ N are i.i.d. G. (d) G = s π 2 (p) = DD(ν). p j δ {τj} and p = (p 1,..., p s ) T (e) β is independent of G and has a prior π 1 (β). has a Dirichlet distribution prior From (a) and (b), P (Y i = 1 σ, β, G) = P (Z i > 0 σ, β, G) = Φ({x T i β}/σ i). Integrated over σ i (from (c)), P (Y i = 1 β, G) = Φ({x T i β}/σ i) dg(σ i ), thus giving back our model of (2). To implement the Gibbs sampler, one needs to simulate from the full conditional distributions of each unobserved variable given the observed y and the remaining variables. These distributions are described next. (i) Given y, β, σ and G; Z 1,..., Z N are independent with Z i distributed as N(x T i β, σ2 i ) truncated at left by 0 if y i = 1, and truncated at right by 0 if y i = 0.

4 bayesian analysis of binary regression 375 In (ii) (iv) below, we assume that the given y and z satisfy (y i 1 2 ) z i > 0, i = 1,..., N. (ii) Given y, β, z and G; σ 1,..., σ N are independent with σ i s q ij δ {τj } where q ij = { p j τ j φ({z i x T i β}/τ j)}/{ s density function. k=1 p k τ k φ({z i x T i β}/τ k)} and φ( ) is the N(0, 1) (iii) Notice that σ i belongs to the set {τ 1,..., τ s } with probability 1. Given y, β, z, σ; let k j = # of σ i which equals τ j. Then G y, β, z, σ = s p j δ {τj} where p = (p 1,..., p s ) T has a Dirichlet distribution DD(ν ) and ν j = ν j +k j, j = 1,..., s. (iv) If we assume a customary diffuse prior π 1 (β) 1, then β y, z, σ, G N k ( β, (X T W X) 1 ) where β = (X T W X) 1 X T W z, W = diagonal(1/σi 2 ), and X = [x 1,..., x N ] T is the design matrix (we assume rank(x) = k). This follows immediately from Bayesian linear model theory. Introduction of the latent variables Z and σ substantially simplifies this calculation and enables us to obtain the conditional densities in closed forms. Simulation from each of the distributions in (i) (iv) is relatively easy. The above model requires two user inputs in the prior structure at step (d); the support set τ = (τ 1,..., τ s ) for σ i s and the Dirichlet distribution parameter ν = (ν 1,..., ν s }. In the examples, we often choose equally spaced values for τ i s with some small and some large values and choose equiprobable ν 1 =... = ν s in the absence of other informations. 3. Asymmetric Links Aranda-Ordaz (1981), Stukel (1988), and Agresti (1990) describe data where asymmetric links produce significantly better fits than symmetric links. Normal scale mixtures, however, always produce symmetric links due to the symmetry of the normal distribution. If we want to keep the normal mixture structure, one possible way to generate asymmetric links is by considering both location and scale mixtures of normals. Location-scale mixtures of normals are extremely rich, in fact, they contain all densities on the real line in their weak convergence closure (see Lo (1984)). They thus also include multiple spiked and multimodal densities. If we use only the data to choose a single link function from this class (for example, through a maximum likelihood procedure), a multiple spiked density probably will be the best choice. The situation is reminiscent of the density estimation scenario where, without any smoothness restriction, a density with spikes at the data points is the maximum likelihood estimate. One way to avoid this problem is to choose an appropriate prior that makes the choice of these undesirable functions less probable a priori. We take an alternate route where we start from a smaller class of functions not containing the undesirable functions. This class is described next.

5 376 sanjib basu and saurabh mukhopadhyay We consider normal distributions which are truncated either at left or at right, and then consider their scale mixtures. Define a cdf F (z, σ) as follows : (i) for σ > 0, F (z, σ) = 0 if z < 0 and = 2 Φ( z z σ ) 1 if z 0; (ii) for σ < 0, F (z, σ) = 2 Φ( σ ) if z < 0 and = 1 if z 0; and (iii) for σ = 0, F (z, σ) = 0 is z < 0 and = 1 if z 0. The corresponding density f(z, σ) = 2 σ φ( z σ ) if σ z > 0 and = 0 otherwise. Notice that f(, σ) is simply the density of N(0, σ 2 ) truncated at left by 0 if σ > 0 and truncated at right by 0 if σ < 0. We consider scale mixtures of F (, σ) as possible choices for the link function H, i.e., H(z) = F (z, σ) dg(σ) where G is a R distribution on the whole real line. The class of normal scale mixtures is a subclass of these distributions; if G is symmetric about 0 then H is a normal scale mixture cdf. On the other hand, if G is asymmetric, so also is H. The symmetry relation Φ( z z σ ) = 1 Φ( σ ) of normal cdfs translates to the following relation in F : F (z, σ) = 1 F ( z, σ). Some caution is also needed in the introduction of the latent variables Z. We replace (a) of section 2 by (a ) Given β and σ, the latent variables Z 1,..., Z N are independent with Z i having density f(z i x T i β, σ i). The rest of the finite mixture model ((b) (e) of section 2) remains the same except that the domain of τ i s changes to < τ 1 <... < τ s <. To avoid complications, we further assume that τ i 0, i = 1,..., s. With this structure, P (Y i = 1 σ, β, G) = P (Z i > 0 σ, β, G) = 1 F ( x T i β, σ i) = F (x T i β, σ i). Integrated over σ i, P (Y i = 1 β, G) = F (x T i β, σ i) dg(σ i ) as we want. For this asymmetric model, the conditional distributions of each unobserved variable given the observed y and the remaining variables is given below. Note, Z i has density f(z i x T i β, σ i), i = 1,..., N. This poses N restrictions : σ i (Z i x T i β) > 0, i = 1,..., N (see definition of f(, σ)), or σ (Z X β) 0 where means left vector > right vector in every coordinate. (1) Given y, β, σ and G; Z 1,..., Z N are independent with Z i having density c 1 φ({z i x T i β}/ σ i ) on the restricted domain Z where Z = {z i : σ i (z i x T i β) > 0, z i > 0} if y i = 1, Z = {z i : σ i (z i x T i β) > 0, z i 0} if y i = 0, and c 1 is the normalizing constant. In (2) (4) below, we assume (y i 1 2 ) Z i > 0, i = 1,..., N. s (2) Given y, β, Z and G; σ 1,..., σ N are independent with σ i c 2 q ij δ {τj } where q ij = p j τ φ({z j i x T i β}/ τ j ) if τ j (Z i x T i β) > 0, and = 0 otherwise. (3) The conditional density of G y, β, Z, σ is same as in (iii) of section 2. (4) For π 1 (β) 1, β y, Z, σ, G is distributed as N k ( β, (X T W X) 1 ) (as in (iv) of section 2) restricted on the support {β : σ (Z Xβ) 0}. The first problem associated with these conditional distributions relates to the irreducibility of the underlying Markov Chain. Suppose we start the Gibbs sampler from an initial positive value for σ i,i.e., σ (0) i > 0. It is then easy to see that at every iteration of the Gibbs cycle the generated σ (r) i will be > 0. Similarly, if the initial σ (0) i < 0, then every generated σ (r) i < 0. This implies that the Markov

6 bayesian analysis of binary regression 377 chain generated by the conditional distributions (1) (4) is not irreducible. We circumvent this problem by generating from the joint distribution of (Z, σ) y, β, G (instead of their individual full conditional distributions). The generation from this joint distribution can be done in the following steps: (i) given y, β, G, the pairs (Z i, σ i ), i = 1,..., N are independent; (ii) generate σ i from the distribution of σ i y, β, G; and then (iii) generate Z i from the conditional distribution of Z i σ i, y, β, G. The last distribution is already obtained in (1). The distribution of σ i y, β, G required in (ii) is very similar to the distribution in (2) above. We s have σ i c 2 qij δ {τ j } where the qij s are now determined by the following rules: Suppose y i = 1. Then qij = p j Φ(min(x T i β, 0)/ τ j )) if τ j < 0. For τ j > 0, qij = 0 if x T i β < 0 and q ij = p j (Φ(x T i β/τ j) 1/2) if x T i β > 0. The case of y i = 0 is similar. The other problem relates to generation β from its full conditional distributions given in (4). We need to generate random deviates from a multivariate normal on the restricted domain {β : c i x T i β > b i, i = 1,..., N} where c i = 1, b i = Z i if σ i > 0 and c i = 1, b i = Z i if σ i < 0. Thus, we need to generate β from a k-dimensional space, but the support of β is restricted by N (N k) linear constraints. Moreover, the support region changes in every Gibbs iteration. Our first attempted solution to this problem was to simply generate β from the k-dimensional multivariate normal N k ( β, (X T W X) 1 ) and then accept or reject depending on whether the generated β falls within the support or outside. However, we soon found out this proposal is extremely inefficient. When N is moderately large (for example, N = 481 and k = 2 as in the Beetle mortality example of section 6), the k-dimensional support of β restricted by the N hyperplanes may be a very crooked region of small volume and may carry a very small percentage of the total mass of the N k ( β, (X T W X) 1 ) distribution. 4. Simulation Techniques In this section we discuss two problems on random variate generations: (i) simulation from a univariate truncated normal distribution and (ii) simulation from a multivariate normal over a restricted support. The first problem has an easy solution by the inverse cdf method unless the support interval is in the far tail of the distribution where the method may run into numerical problems. Many other efficient methods are available for generation from distribution tails, see Schmeiser (1980), Daganpur (1988). Since normal distribution is log-concave, one may also use the adaptive rejection method of Gilks and Wild (1992). We follow the envelope rejection method where the truncated normal density is dominated by a truncated exponential density λ exp( λ(x a)) I {x a} and the parameter λ is chosen optimally. See Daganpur (1988, p185) for more details. The second simulation problem arises from the asymmetric link model where we need to generate a random β from the k-dimensional distribution N k ( β, (X T W X) 1 ) on the restricted support {β : σ (Z Xβ) 0}. In the following, we discuss the

7 378 sanjib basu and saurabh mukhopadhyay general problem of generating a random deviate X from a multivariate density f restricted on a support set A R k. Let S be the set {(x, u) : x A, 0 u f(x)} R k+1, i.e., the region under the graph of f over A. Our problem is then equivalent to generating a uniformly distributed random point Y = (X, U) on S since X (= the projection of Y onto A) is then a random deviate with density f on A (see Devroye (1986, Theorem 3.1, p40)). We use the Markov chain Monte Carlo method proposed by Smith (1984) to generate uniformly distributed points over a bounded region S (also see Rubin (1984) for other approaches). Smith s mixing algorithm is as follows : (1) Start with an initial point Y 0 S and i = 0. (2) Generate a random direction d uniformly distributed over a direction set D R k+1. Find the line set L = S {y : y = Y 0 + λ d, λ a real scalar}. Generate a new point Y i+1 uniformly distributed over L. (3) If i < the presepecified maximum iteration number, set i = i + 1 and go back to (2). Smith shows that (under some assumptions) this Markovian scheme generates points asymptotically uniformly distributed over S. One choice for the direction set D is the set of (k + 1) coordinate directions. Moreover, if the restricted region is of the form {x : Ax b}, then the choice of D = {coordinate directions} especially simplifies the determination of the line segment L at every iteration. Notice that the restricted region we consider in the asymmetric link model is exactly of this form. However, if the region under consideration is in the shape of an elongated polygonal tube of small cross-section at an angle to the coordinate axes, then the coordinate direction algorithm will take many small steps and the rate of convergence will be painfully slow. Our experience with the examples suggest that this can happen in the asymmetric link model. Instead of the coordinate directions, we thus choose the alternative random directions algorithm where at each iteration a random direction is chosen from the direction set D = the (k + 1) dimensional unit sphere = {d : d = 1}. The determination of the line segment L is slightly more complicated in the random direction algorithm, but the convergence rate is faster. 5. Model Diagnostics Model selection and model diagnostics are integral parts of any data analysis. The formal Bayesian criterion for comparison of two models is the Bayes factor. The computation of Bayes factor from Markov Chain Monte Carlo analysis, however, is typically difficult since the simulation methods avoid the computation of the normalizing constant and this is precisely what is needed in the Bayes factor (See DiCiccio et al. (1997) for a review of various Bayes factor estimation methods). We avoid these complications of Bayes factor computation and instead use the crossvalidated predictive criteria which have been proposed and used by Geisser and Eddy (1979), Gelfand, Dey, and Chang (1992), Gelfand (1996), Gelfand and Ghosh (1998), among others.

8 bayesian analysis of binary regression 379 In many applications, we observe multiple independent binary responses, under the same covariate vector x i. Let L be the number of distinct x i s, and we denote them by x 1,..., x L. Let n k be the total number of binary Y s observed under x k ( L n k = N) out of which T k (= Y i ) many are 1 s. According to our k=1 i : x i=x k sampling model, T 1,..., T L are independent and T k Binomial(n k, θ k ) where θ k = H(x T β). For model checking, we cross validate the sufficient statistics T 1,..., T L (instead of the Y s). Let t k be the observed value of T k, t be the L 1 observed data vector, and let t (k) denote the (L 1) 1 vector with k-th observation t k deleted. Also, let ω = (β, G, Z, σ) denote the set of unobserved variables. We use f to denote predictive distributions (e.g. f(t k t (k) )) as well as sampling distributions (f(t k ω)), and π to denote priors (π(ω)) as well as posteriors (π(ω t)). We assume all relevant integrals in the following exist. We check models from a cross validated predictive approach and examine f(t k t (k) ), i.e., the predictive distribution of the random variable T k conditioned on the remaining observations t (k). Following Gelfund, Dey and Chang (1992), we compare a random T k from f(t k t (k) ) against the observed value t k by the following two checking criteria. See Gelfund, Dey and Chang (1992) for other checking criteria and more details. (a) d 1k = expected difference between the observed t k and the random T k, i.e., t k µ k where µ k is the mean of the distribution f(t k t (k) ). d 1k are thus the familiar residuals. We can also compute the studentized residuals as d 1k = d 1k s k where s 2 k = Var[T k t (k) ]. We use the quantity Q 1 = (d 1k )2 as a summary model diagnostic index. (b) d 2k = f(t k t (k) ), i.e., the likelihood of observing T k = t k given the remaining observations t (k). Small values of d 2k criticize the model. As suggested by Geisser and Eddy (1979) and Gelfund, Dey and Chang (1992), we use Q 2 = L d 2k as the second summary index of model diagnostic. Notice that Q 2 can be interpreted as a joint pseudo marginal likelihood of the observed t. To compute d 1k and d 2k, we need the mean µ k, the variance s 2 k, and the value f(t k t (k) ) of the predictive distribution f(t k t (k) ). Notice f(t k t (k) ) = f(tk ω) π(ω t (k) ). One possible strategy to approximate f(t k t (k) ) is as follows : (i) delete t k from the observed data vector t to obtain t (k) ; (ii) use Gibbs sampling (sections 2 and 3) to generate R many Monte Carlo samples of ω r from π(ω t (k) ); and (iii) approximate f(t k t (k) ) by the Monte Carlo sum 1 R R f(t k ω r ). Since we need f(t k t (k) ) for every k = 1,..., L, this strategy r=1 would require L separate Gibbs sampling runs. The following alternative strategy works faster; it estimates f(t k t (k) ) for every k = 1,..., L from a single Gibbs run. Notice, if we can generate ω samples from π(ω t (k) ), we can then approximate f(t k t (k) ) by (iii) above. But, π(ω t (k) ) = f(t (k) ω) π(ω)/m(t (k) ) = {m(t)/m(t (k) )} π(ω t)/f(t k ω) k=1

9 380 sanjib basu and saurabh mukhopadhyay = {c(t, t (k) )/f(t k ω)} π(ω t) where m(t) = f(t ω) π(ω) dω is the marginal and c(t, t (k) ) = m(t)/m(t (k) ) is a constant. If we now generate R many Monte Carlo ω samples from the complete posterior distribution π(ω t) by one Gibbs sampling run, f(t k t (k) ) can be estimated by {c(t, t (k) )/R} R f(t k ω r )/f(t k ω r ) for every k = 1,..., L. r=1 The constant c(t, t (k) ) is not known, but notice 1 = π(ω t (k) ) dω = c(t, t (k) ) {1/f(tk ω)} π(ω t) dω. Hence, c(t, t (k) ) can also be estimated from the same Gibbs run by {R 1 R r=1 1/f(t k ω r )} 1. To obtain d 1k, we need µ k = E[T k t (k) ] and σ 2 k = Var[T k t (k) ] which suggests that we may need to estimate the predictive density f(t k t (k) ) for a whole range of values of T k. However, this again could be simplified. Notice, µ k = E[T k t (k) ] = E[ E[T k t (k), ω] t (k) ] = E[ E[T k ω] t (k) ] since T 1,..., T L are conditionally independent given ω. In our setup, T k ω Binomial(n k, θ k ) where θ k = H(x kt β), thus the inside E[T k ω] = n k θ k and µ k = n k θk π(ω t (k) ) dω. Similarly, σ 2 k = Var[T k t (k) ] = Var[ E[T k ω] t (k) ] + E[ Var[T k ω] t (k) ] = Var[n k θ k t (k) ] + E[n k θ k (1 θ k ) t (k) ] = µ 2 k + {n 2 kθk 2 + n k θ k (1 θ k )} π(ω t (k) ) dω. µ k and σk 2 can now be easily estimated by the corresponding Monte Carlo sums taken over samples generated from the posterior distribution. 6. Application We illustrate the use of our binary response models and the calculation of our model diagnostic tools in two examples. The first example studies the well-known Beetle mortality data. We compare analyses based on the the finite mixture model and the asymmetric link model. In the second example, we analyze the Challenger o ring distress data. We compare the performances of our proposed finite mixture and asymmetric link models with Bayesian probit link and t-links. In addition, we also study the performance of the general normal mixture link model proposed in Basu and Mukhopadhyay (1998). Example 1. Bliss (1935) reports the results of a toxicological experiment concerning the number of beetles killed after 5 hours exposure to gaseous carbon disulphide at various concentrations. Figure 1 shows the observed proportion of

10 bayesian analysis of binary regression 381 beetles killed against log dosage of carbon di-sulphide. The plot clearly shows a non-symmetric structure. Aranda-Ordaz (1981), Stukel (1988), Agresti (1990, pp ) and many others examined these data from a non-bayesian viewpoint. These earlier analyses found that typically asymmetric link models lead to significant improvement in the maximum likelihood based model fit L o g D o s e P a t t e r n O b s e r v e d F i n i t e A s y m m e t r i c Figure 1. Observed proportions of beetles killed and posterior expected mortality probabilities from MF and MA models We analyze these data using our proposed finite mixture link model MF and the asymmetric link model MA. We use log dosage as the single covariate. Thus, our postulated link structure is P (Y = 1) = H(β 0 + β 1 log-dosage). In the finite mixture model MF, the function H is a finite scale mixture of normal cdfs, H( ) = s p j Φ( /τ j ) where the mixing probabilities p = (p 1,..., p s ) follow a Dirichlet distribution DD(ν). We use s = 11 and the set of τ j values as T = {0.5, 1, 2, 3, 4, 6, 8, 10, 15, 20, 30}. This choice covers the range of small, moderate as well as large τ values. We use equal values for the Dirichlet distribution parameter, ν 1 =... = ν 11 = α/11 where α > 0 is a user-specified constant. For the asymmetric link model MA, we use similar specification of the prior parameters. Here, the support point of τ j s are chosen to be T { T } where T = {0.5, 1, 2, 3, 4, 6, 8, 10, 15, 20, 30} as in the former model MF. The Dirichlet distribution parameters are specified to be ν 1 =... = ν 22 = α/22. In Figure 1, the posterior expected probabilities of beetle mortality obtained from models MF and MA are plotted along with the observed proportions. The asymmetric link model MA easily adapts to the non-symmetric shape displayed by

11 382 sanjib basu and saurabh mukhopadhyay the observed proportions and shows a better fit compared to the symmetric finite mixture model MF. Let n k and t k denote the number of beetles exposed and the number of beetles killed at a particular log-dosage level. We use the sum of squared differences between the observed and expected counts, i.e., SSE = L k=1 [t k n k E{H(β 0 + β 1 log-dosage) t}] 2 as a summary index of model fit. We further calculate the summarized diagnostic indices Q 1 = d 1k2 and Q 2 = d 2k as proposed in section 5. These values are shown in Table 1. The SSE for model MA is almost half of the SSE value for model MF indicating a significantly better fit from the former model. A comparison of the diagnostic index Q 2 values for the two models provides further support for this statement. The value of Q 2 for the MA model is almost 38 times higher than the MF model. The other index Q 1, however, is slightly higher for the asymmetric model MA. This is mostly due to the high influence (d 1k = 5.28) of the 6th observation (see plots of d 1k and d 2k in Figure 2). Overall, the asymmetric link model MA fits significantly better than the finite mixture model MF on these data. Table 1. Beetle mortality data : SSE and summary model diagnostic measures Finite mixture model MF Asymmetric model MA SSE Q 1 = d 2 1k k=1 48 Q 2 = d 2k k=1 Example 2. An interesting application of binary regression to the risk analysis of the Challenger space shuttle is given in Dalal, Fowlkes and Hoadley (1989). The Rogers commission concluded that the Challenger accident was caused by gas leak through the 6 o-ring joints of the shuttle. Dalal, Fowlkes and Hoadley (1989) looked at the number of distressed o-rings (among the 6) versus launch temperature (Temp.) and pressure (Pres.) for 23 previous shuttle flights. The previous shuttles were launched at temperatures between 53 F and 81 F. A maximum likelihood logistic regression analysis with temperature and pressure as covariates yields insignificant effect of pressure and predicts a strong probability ( 82%) of distress in o-rings at Temp. = 31 and Pres. = 200, the actual launch conditions of the fatal shuttle. However, Lavine (1991) later pointed out that such a prediction depends strongly on the choice of the link function. A probit or a complimentary log-log link model fits the data equally well, but predicts a smaller probability of distress (about 67%) at the launch conditions. In our Bayesian analysis of Challenger data, we include both temperature and pressure as covariates. We examine the following models for the link function H; MP: a probit link, H = Φ; MT : a t link, H = cdf of a t distribution; MF : a finite mixture model; and MA: an asymmetric link model. In addition, we examine the

12 bayesian analysis of binary regression 383 d 1 k P a t t e r n A s y m m e t r i c F i n i t e l o g ( d 2 k ) P a t t e r n A s y m m e t r i c F i n i t e Figure 2. Beetle mortality data : Model diagnostics for MF and MA models general mixture model MG proposed in Basu and Mukhopadhyay (1998). In this model, the link function, H( ) = Φ( /σ)dg(σ), is still a scale mixture of normals. However, the mixing distribution, G, is not restricted to be supported on finitely many points. Rather, it can be any arbitrary distribution. A Dirichlet process prior is assumed on the mixing distribution G. We refer the reader to Basu and Mukhopadhyay (1998) for further details about this general mixture model.

13 384 sanjib basu and saurabh mukhopadhyay The performances of these five models in the Challenger data are shown in Table 2 and Figures 3 and 4. In terms of SSE, the general mixture model MG performs the best, though the SSE values for the other models (except for MA) are comparable. The asymmetric link model MA has the worst (largest) SSE value but its summary model diagnostic measure Q 1 = d 1k2 is the best (smallest). The plots of d 1k and d 2k in Figure 4 point out the 5th and the 14th observation as influential in models MF and MG. Both Dalal, Fowlkes and Hoadley (1989) and Lavine (1991) also found observation 14 troublesome. However, notice that neither of these observations are influential in the MA model. In fact, all the d 1k and d 2k values are within reasonable range in the MA model. The MA model thus adapts itself to guard against the influential observations, but loses in terms of SSE (the model fitting criterion) in the process. This dichotomy between model diagnostics and model fitting can also be seen in the plot of fitted probabilities in Figure 3. Table 2. Challenger data SSE, summary model diagnostics and predicted failure probability, P (31, 200) Model SSE Q 1 = d 2 1k Q 2 = d 2k P (31, 200) k=1 k=1 MP MT MF MG MA Figure 3. Observed proportions of o-ring distress and posterior expected / predicted distress probabilities from the five models : MP, MT, MF, MG and MA

14 bayesian analysis of binary regression 385 Figure 4. Challenger data : Model diagnostics for MF, MG and MA models One of the principal aim of the analysis of Challenger data is to predict the probability of a failure (P(31,200)) at Temp. = 31 and Pres. = 200. For our five Bayesian models, the P(31,200) values listed in Table 2 show a wide range. This is clearly expected and is in agreement with Lavine s (1991) findings. As seen in Figure 4, Temp. = 31 is far beyond the range of the observed data. Prediction at such an extrapolated point is expected to strongly depend on the choice of the link function and choice of other parameters of the model. 7. Discussion In this article, we have proposed Bayesian analyses of binary response regression using scale mixture of cdfs as the link function H. We have presented two different link structures: finite normal scale mixtures MF and scale mixtures of truncated normals MA. These models introduce flexibility in the choice of H and free the user from using a single pre-specified functional form. Moreover, the model MA introduces additional flexibility by allowing asymmetry in H. Basu and Mukhopadhyay (1998) generalized the finite mixture model to a general mixture model where the mixing distribution G is not restricted to be supported on finitely many points. They use a Dirichlet process prior on G. We note here

15 386 sanjib basu and saurabh mukhopadhyay that their proposed methodology can easily be implemented on our asymmetric link model of section 3. The resulting analysis would involve a simple combination of their methodology with the techniques proposed in section 3. References Albert, J.H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data, Journal of the American Statistical Association, 88, Agresti, A. (1990). Categorical Data Analysis, John Wiley. Aranda-Ordaz F.J. (1981). On two families of transformations to additivity for binary response data, Biometrika, 68, Basu, S. and Mukhopadhyay, S. (1998). Binary response regression with normal scale mixture links, In Generalized Linear Models: A Bayesian Perspective, D.K. Dey et al. (Eds.) , Marcel Dekker, New York. Daganpur, J. (1988). Principles of Random Variate Generation, Oxford University Press, Oxford. Dalal, S.R., Fowlkes, E.B., and Hoadley, B. (1989). Risk analysis of space shuttle : Pre- Challenger Prediction of Failure, Journal of the American Statistical Association, 84, DiCiccio, T.J., Kass, R.E., Raftery, A., Wasserman, L. (1997). Computing Bayes factor by combining simulation and asymptotic approximations, Journal of the American Statistical Association, 92, Devroye, L. (1986). Nonuniform Random variate Generation. Springer-Verlag, New York. Chen, M-H. and Dey, D.K. (1998). Bayesian modeling of correlated binary responses via scale mixture of multivariate normal link functions, Sankhya, 60, 322. Geisser, S. and Eddy, W. (1979). A predictive approach to model selection, Journal of the American Statistical Association, 74, Gelfand, A.E. (1996). Model determination using sampling-based methods, In Markov Chain Monte Carlo in Practice (W.R. Gilks, S. Richardson, and D.J. Spiegelhalter Eds.), Chapman and Hall, London Gelfand, A.E., Dey, D.K., and Chang, H. (1992). Model determination using predictive distributions with implementations via sampling-based methods, In Bayesian Statistics 4, J.M. Bernardo, et. al. (Eds.), Oxford University Press, Oxford. Gelfand, A.E. and Ghosh, S.K. (1998). Model choice: A minimum posterior predictive loss approach. Biometrika, 85, Gilks, W.R. and Wild, P. (1992). Adaptive rejection sampling for Gibbs sampling, Applied Statistics, 41, Guerrero, V.M. and Johnson, R. (1982). Use of the Box-Cox transformation with Binary Response models, Biometrika, 69, Lavine, M. (1991). Problems in extrapolation illustrated with space shuttle o-ring data, Journal of the American Statistical Association, 86, Lo, A.Y. (1984). On a class of Bayesian nonparametric estimates: I. Density estimates, Annals of Statistics, 12, McCullagh, P. and Nelder, J. (1989). Generalized Linear Models. 2nd ed., Chapman and Hall. Prentice, R.L. (1976). A generalization of the probit and logit models for dose response curves, Biometrics, 32, Rubin, P.A. (1984). Generating random points in a polytope, Communications in Statistics- Simulation and Computation, 13, Schmeisser, B.W. (1980). Generation of variates from distribution tails, Operat. Res., 28,

16 bayesian analysis of binary regression 387 Smith, R.L. (1984). Efficient Monte Carlo procedures for generating points uniformly distributed over bounded regions, Operat. Res., 32, Stukel, T.A. (1988). Generalized logistic models, Journal of the American Statistical Association, 83, Sanjib Basu Division of Statistics Northern Illinois University DeKalb, IL USA Saurabh Mukhopadhyay Merck Research Laboratories P.O. Box 2000, RY Rahway, New Jersey USA saurabh

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence Bayesian Inference in GLMs Frequentists typically base inferences on MLEs, asymptotic confidence limits, and log-likelihood ratio tests Bayesians base inferences on the posterior distribution of the unknowns