Lecture 13: More on Binary Data

Lecture 1: More on Binary Data Link functions for Binomial models Link η = g(π) π = g 1 (η) identity π η logarithmic log π e η logistic log ( π 1 π probit Φ 1 (π) Φ(η) log-log log( log π) exp( e η ) complementary log( log(1 π)) 1 exp( e η ) log-log ) e η 1+e η Comparison of Link Functions g(p) -4-2 0 2 4 Logit Probit CLL 00 02 04 06 08 10 p When g is the identity or logarithmic function, ˆπ = g 1 (ˆη) may lie outside the interval [0, 1] Therefore, these link functions may not be the best choices The logistic and probit are by far the most commonly used 1

The probit model requires numerical integration in the computation of the MLEs (since Φ does not have a closed form) Both the logistic and probit are symmetric about π = 1/2, and produce very similar results in most analysis (eg, beetle example) unless there are many small or large probabilities A very large amount of data would be required to show that one was better than the other The logistic, probit, and complementary log-log links are similar for small π The complementary log-log link is skewed (asymmetric) Advantages of the logistic model The logistic model provides direct interpretation as log-odds of a success This interpretation is particularly nice in the context of case-control studies Here, the exposure level (eg of a toxin) is compared between individuals who have a particular disease or condition (the cases ) and those who do not have the disease (the controls ) The logit link is the canonical link for binomial distribution, hence is mathematically convenient Interpretation of the link functions: bioassays and dose-response models A biological assay is a method for estimating the potency of a material by means of the reaction which follows its application to living matter An m-point assay is an assay where a lethal drug is administered in m doses d 1,, d m where d i = log(concentration) (usually) In the experiment, n i subjects get dose d i, and y i of these subjects die We model Y i Binomial(n i, π i ), where π i = P(death given dose i) = function of d i 2

Tolerance Distribution We think of different subjects as having different tolerances to a drug Define D to be the minimum dose required to produce a response in a subject Then D is a random variable Its distribution is called the tolerance distribution, and is denoted by F D An animal is killed by dose d i if and only if its tolerance is less than or equal to d i, ie, D d i Thus, the probability π i of death at dose d i is given by π i = P(D < d i ) = F D (d i ) Probit Analysis Based on the assumption of a normal tolerance distribution, ie D N(µ d, σ 2 d), then ( ) di µ d π i = P(D < d i ) = Φ, σ d where Φ is the N(0, 1) cdf Thus, where β 0 = µ d /σ d and β 1 = 1/σ d Φ 1 (π i ) = d i µ d σ d β 0 + β 1 d i, Hence, the mean and standard deviation of the tolerance distribution can be estimated via the regression parameters in this model Logistic analysis Assume now that the tolerance distribution is the logistic distribution, ie f D (d) = exp {(d µ d )/τ} τ [1 + exp {(d µ d )/τ}] 2, where < d <, < µ d <, and τ > 0 Then E[D] = µ d, and Var[D] = π 2 τ 2 / (where π = 1415 ) Under this assumption, π i = F D (d i ) = exp {(d i µ d )/τ} 1 + exp{(d i µ d )/τ}

where β 0 = µ d /τ and β 1 = 1/τ ( ) πi log = d i µ d 1 π i τ = β 0 + β 1 d i, Complementary log-log analysis Similarly, the assumption of the extreme-value tolerance distribution a d i F D (d i ) = 1 e e b (where b < 0) leads to the complementary log-log model Example: Beetle data, cont Goal: Find the tolerance distribution We chose a linear complementary log-log model for the probability π of death at a given dose x This implies that the tolerance distribution is the extreme value distribution Therefore, F D (x) = π The density function of D is given by = 1 exp [ exp(β 0 + β 1 x)] f(x) = β 1 exp [(β 0 + β 1 x) exp(β 0 + β 1 x)] An estimate of this density can be obtained by substituting the estimated values ˆβ 0 and ˆβ 1 : ˆf(x) = 01525 exp [( 960 + 01525x) exp( 960 + 01525x)] Probit-logistic relationship The normal and logistic distributions are often quite similar (for certain values of their parameters) Consider the case where the true tolerance distribution is similar to these distributions In this case, we would expect the GLMs with probit and logistic links to fit almost equally well More specifically, if we construct logistic and normal tolerance distributions with the same mean and variance (so that these distributions are very similar), we can see the following approximate relationship between the probit and logistic models: 4

Estimated Tolerance Distribution (Extreme Value Density) f(x) 00 001 002 00 004 005 40 50 60 70 80 x Equating variances gives σ 2 = π 2 τ 2 / σ = π τ Thus, probit(π) = d µ d = σ π ( d µd τ ) ( ) = π logit(π) 055 logit(π) If we denote the linear predictor under the probit model by η i = p j=1 x ij β j and the linear predictor under the logit model by ηi = p j=1 x ij βj, then η i 055ηi p x ij β j p 055 x ij βj (for all x ij ) j=1 j=1 β j 055β j (for all j) In this case, we see that the regression coefficients estimated under the probit model are approximately 055 times those estimated under the logit model Thus, typically we choose between the logit and probit model based on considerations such as interpretation rather than goodness-of-fit 5