Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Latent Variable Models for Binary Data Suppose that for a given vector of explanatory variables x, the latent variable, U, has a continuous cumulative distribution function F (u; x) and that the binary response Y = 1 is recorded if and only if U > 0: θ = Pr(Y = 1 x) = 1 F (0; x). Since U is not directly observed there is no loss of generality in taking the critical (i.e., cutoff point) to be 0. In addition, we can take the standard deviation of U (or some other measure of dispersion) to be 1, without loss of generality. 1

Probit Models For example, if U N(x β, 1) it follows that θ i = Pr(Y = 1 x i ) = Φ(x iβ), where Φ( ) is the cumulative normal distribution function Φ(t) = (2π) 1/2 t exp( 1 2 z2 ) dz. The relation is linearized by the inverse normal transformation Φ 1 (θ) = x iβ = p j=1 x ij β j. 2

We have regarded the cutoff value of U as fixed and the mean of U to be changing with x. Alternatively, one could assume that the distribution of U is fixed and allow the critical value to vary with x (e.g., dose) In toxicology studies where dose is the explanatory variable it makes sense to let V denote the minimum level of dose needed to produce a response (i.e., tolerance) 3

Under the second formulation, y i = 1 if x i β > v i It follows that Pr(Y = 1 x i ) = Pr(V x iβ). Note that the shape of the dose-response curve is determined by the distribution function of V If V N(0, 1), then Pr(Y = 1 x i ) = Φ(x iβ), and it follows that the U and V formulations are equivalent The U formulation is more common 4

Latent Utilities & Choice Models Suppose that Fred is choosing between 2 brands of a product (say, Ben & Jerry s or Haigen Daz) Fred has a utility for Ben & Jerry s (denoted by Z i1 ) and a utility for Haigen Daz (denotes by Z i2 ) Letting the difference in utilities be represented by the normal linear model, we have U i = Z i1 Z i2 = x iβ + ɛ i, where ɛ i N(0, 1). If Fred has a higher utility for Ben & Jerry s, then Z i1 > Z i2, U i > 0, and Fred will choose Ben & Jerry s (Y i = 1) 5

This latent utility formulation is again equivalent to a probit model for the binary response. The generalization to a multinomial response is straightforward by introducing k latent utilities instead of 2, and letting an individual s response (i.e., choice) correspond to the category with the maximum utility Although the probit model is preferred in bioassay and social sciences applications, the logistic model is preferred in the biomedical sciences Of course, the choice of distribution function for U (and hence the choice of link in the binary response GLM) should be motivated by model fit. 6

Logistic Regression The normal form is only one possibility for the distribution of U. Another is the logistic distribution with location x iβ and unit scale. The logistic distribution has cumulative distribution function so that F (u) = exp(u x iβ) 1 + exp(u x iβ), F (0; x i ) = 1/{1 + exp(x iβ)}, It follows that Pr(Y = 1 x i ) = Pr(U > 0 x i ) = 1 F (0; x i ) = 1/{1+exp( x iβ)}. To linearize this relation, we take the logit transformation of both sides, log{θ i /(1 θ i )} = x iβ. 7

Homework Exercise: For x i = (1, x) and β 2 > 0, reformulate the logistic regression in terms of a threshold model (i.e., the V formulation of the probit model described above). Derive the probability density function (pdf) obtained by differentiating Pr(Y = 1 x i ) with respect to x. Reparameterize in terms of τ = 1/β 2 and µ = β 1 /β 2. Plot this pdf for µ = 0 and πτ/ 3 = 1 along with the N(0,1) pdf in S-PLUS. Which density has the fatter tails? Is the pdf for x in the logistic case in the exponential family? 8

Some Generalizations of the Logistic Model The logistic regression model assumes a restricted dose-response shape and it is possible generalize the model to relax the restriction Aranda-Ordaz (1981) proposed two families of linearizing transformation, which can easily be inverted and which span a range of forms. The first, which is restricted to symmetric cases (i.e., invariant to interchanging success & failure) is 2 θ ν (1 θ) ν ν θ ν + (1 θ). ν In the limit as ν 0, this is logistic and for ν = 1 this is linear The second family has log[{(1 θ) ν 1}/ν], which reduces to the extreme value model when ν = 0 and the logistic when ν = 1. 9

When there is doubt about the transformation, a formal approach is to use one or the other of the above transformations and to fit the resulting model for a range of possible values for ν A profile likelihood can be obtained for ν by plotting the maximized likelihood against ν (Frequentist) Potentially, one could choose a standard form, such as the logistic, if the corresponding value of ν falls with the 95% profile likelihood confidence region. Alternatively, we could choose a prior density for ν and implement a Bayesian approach. 10

Data Augmentation Algorithms for Probit Models (Albert & Chib, 1993, JASA, 669-679) Now, suppose that p i = Pr(y i = 1 x i, β) = Φ(x iβ), where Φ( ) is the N(0,1) cdf. As discussed previously, this probit regression model is equivalent to: y i = 1(z i > 0), z i N(x iβ, 1), where z 1,..., z n are independent latent variables. Note that if the z i are known and a multivariate normal prior is chosen for β, the posterior distribution is multivariate normal. 11

The z i are unknown latent variables, which we introduce for computational convenience and which have no impact on the model interpretation. By introducing the z i s, we are augmenting the observed data y = (y 1,..., y n ) with latent data z = (z 1,..., z n ). The joint posterior density of the unobservables β and z is π(β, z y) π(β) n i=1 {1(z i > 0) 1(y i = 1) + 1(z i 0) 1(y i = 0)}N(z i ; x iβ, 1), which is the prior for β times the prior for z given β times the likelihood for y given z and β. 12

Note that, integrating out the latent data, we have π(β y) = π(β, z y) dz π(β) n i=1 = π(β) n i=1 = π(β) n i=1 { 0 N(z i; x iβ, 1) dz } 1(y i =0) { i N(z 0 i ; x } iβ, 1) dz i Φ( x iβ) 1(y i=0) Φ(x iβ) 1(y i=1), {1 Φ(x iβ)} 1(y i=0) Φ(x iβ) 1(y i=1), and computation could proceed for this binary-response glm using Gibbs sampling with ARS (e.g., in WinBUGS) This procedure updates the parameters one at a time and requires programming of ARS. 13

Alternative: Data Augmentation Gibbs sampler 1. Choose initial values for β and prior density, β N(β 0, Σ 0 ). 2. Impute the latent data by sampling from the full conditional distribution, π(z i β, y) {1(z i > 0) 1(y i = 1) + 1(z i 0) 1(y i = 0)}N(z i ; x iβ, 1) d = truncated N(x iβ, 1), for z i > 0 if y i = 1, and z i 0 if y i = 0. 3. Update β (jointly!) by sampling from the full conditional, π(β z, y) d = N( β, Σ β ), where Σ β = (Σ 1 0 + X X) 1 is the posterior covariance conditional on z and β = Σ β (Σ 1 0 β 0 + X z). 4. Repeat steps 2-3 until apparent convergence and calculate posterior summaries for β based on a large number of additional iterates. 14

Some Comments If you like probit models, the Albert and Chib algorithm is extremely useful being very easy to program and efficient relative to ARS. Probit models have the disadvantage that the regression coefficients cannot be expressed as a simple analytic function of the probability of response. However, by approximating the logistic model using a scale mixture of normals, one can modify the Albert and Chib approach for logistic regression (O Brien and Dunson, 2003, ISDS Discussion Paper 03-08). Underlying normal models are not limited to univariate binary data - generalizations extremely useful. 15

Extending GLMs for Correlated Data GLMs assume that the observations y 1,..., y n are independent draws from an exponential family distribution However, in many applications, there may be dependency in the outcome data For example, in longitudinal studies, repeated observations are collected for each study subject 16

Longitudinal Studies For subject i (i = 1,..., n), outcome data consist of an n i 1 vector of measurements at follow-up times t i,1,..., t i,ni. Instead of a single measurement, y i, for subject i, we have a vector of measurements, y = (y i1,..., y i,ni ). Since different measurements for a subject may be correlated, the standard GLM is not appropriate 17

Possibilities for Repeated Measures Data 1. Conditional Model: Allow the linear predictor to differ for the different study subjects, η ij = x ijβ + b i, where x ij does not include an intercept and b i is a subject-specific parameter (i.e., subject is a blocking factor in ANOVA jargon). 2. Marginal Model: Specify a marginal model for the population averaged response, and construct a variance estimator which takes into account the correlation structure (e.g., Generalized Estimating Equations, Liang and Zeger, 1986) 3. Mixed Model: Assume the regression coefficients for a subject are drawn from a population distribution, and estimate both the population and individual-specific parameters (Laird and Ware, 1982). 18