Likelihoods for Generalized Linear Models

Size: px

Start display at page:

Download "Likelihoods for Generalized Linear Models"

Ami Atkinson
5 years ago
Views:

1 1 Likelihoods for Generalized Linear Models 1.1 Some General Theory We assume that Y i has the p.d.f. that is a member of the exponential family. That is, f(y i ; θ i, φ) = exp{(y i θ i b(θ i ))/a i (φ) + c(y i ; φ)} for some specific functions a i ( ), b( ) and c( ), with φ known. The parameter θ is called the canonical parameter. The parameter φ, termed the scale parameter, is constant for all i = 1,...,n. For now consider φ to be specified. Consider a log-likelihood for a single observation from the exponential family. l(θ, φ; y) log f(y; θ, φ) = (yθ b(θ))/a(φ) + c(y; φ) S(θ) = l θ = y b (θ) a(φ) I(θ) = 2 l θ = b (θ) 2 a(φ) Consider the following very general results :

2 1.1. SOME GENERAL THEORY But Therefore, which in turn implies E(S(θ)) = 0. Differentiating again, we find 2 f(y; θ, φ)dy = 1 f(y; θ, φ)dy = 1/ θ θ f(y; θ, φ)dy = 0 θ θ log f(y; θ, φ) = 1 f(y; θ, φ) log f(y; θ, φ)f(y; θ, φ)dy + θ2 2 which implies, log f(y; θ, φ)f(y; θ, φ)dy + θ2 f(y; θ, φ). θ log f(y; θ, φ)f(y; θ, φ)dy = 0 θ log f(y; θ, φ)f(y; θ, φ)dy = 0/ θ θ θ θ log f(y; θ, φ) f(y; θ, φ)dy = 0 θ ( θ log f(y; θ, φ))2 f(y; θ, φ)dy = 0, E( 2 log f(y; θ, φ)/ θ 2 ) = E(( log f(y; θ, φ)/ θ) 2 ) These facts are useful to us for deriving the following results. Mean : E(S(θ)) = 0 E(Y b (θ))/a(φ)) = 0 which in turn implies that E(Y ) = b (θ) = µ. The key result here is that µ = b (θ). Variance : E( 2 log f(y; θ, φ)/ θ 2 ) = E(( log f(y; θ, φ)/ θ) 2 ) implies E(b (θ)/a(φ)) = E( ((y b (θ))/a(φ)) 2 ) which in turn implies b (θ)/a(φ) = Var(Y )/a(φ) 2 and hence Var(Y ) = b (θ)a(φ). The variance can be written as a product of b (θ) - a function of the canonical parameter and hence the mean of the distribution - sometimes called the variance function and denoted by V (µ) when considered as a function of µ, the mean - note that this is not in general the variance of Y. a(φ) - a function only of φ RJ Cook 2 October 21, 2008

3 CHAPTER 1. LIKELIHOODS FOR GENERALIZED LINEAR MODELS - can commonly be written as a(φ) = φ/w where φ is called a dispersion parameter and w is a prior weight for this observation. Link Functions The link functions relate η (the linear predictor) to µ, the expected value of the random variable Y. Canonical Links Consider the log-likelihood function from the exponential family. l(θ, φ; y) = (yθ b(θ))/a(φ) + c(y; φ). If y 1, y 2,..., y n is a random sample of n observations we introduce a subscript i and write If we set l(θ, φ; y) = n [(y i θ i b(θ i ))/a i (φ) + c(y i ; φ)] i=1 θ i = η i = x i β we say we have a canonical link (i.e. canonical parameter = linear predictor) l(β, φ; y) = = n i=1 j=1 [(y i x iβ b(x iβ))/a i (φ) + c(y i ; φ)] p n β j ( y i x ij /a i (φ)) i=1 n b(x i β)/a i(φ) + i=1 n c(y i ; φ) If φ is known, n i=1 y ix ij is a sufficient statistic for β j, or equivalently, Xy is a sufficient statistic for β. Canonical links are useful in terms of their statistical properties, but context and goodness of fit should motivate the choice of the link. It often turns out that the canonical links are in fact appropriate (i.e. linear regression with normally distributed observations, logistic regression). i=1 RJ Cook 3 October 21, 2008

4 1.1. SOME GENERAL THEORY a. Normal Distribution Dropping the subscript i, consider a single observation from the N(µ, σ 2 ) distribution. It has density f(y; θ, φ) = = = 1 2πσ 2 exp{ (y µ)2 /2σ 2 } 1 2πσ 2 exp{ (y2 2µy + µ 2 )/2σ 2 } 1 2πσ 2 exp{(yµ µ2 /2)/σ 2 y 2 /2σ 2 } = exp{(yµ µ 2 /2)/σ 2 y 2 /2σ 2 log(2πσ 2 )/2} Matching this up with the general expression for the exponential family, we see θ = µ, φ = σ 2, mean : b (θ) = θ = µ variance : b (θ)a(φ) = 1 φ = σ 2 a(φ) = φ b(θ) = 1 2 θ2 = 1 2 µ2 c(y; φ) = 1 2 (y2 /φ + log(2πφ)) = 1 2 (y2 /σ 2 + log(2πσ 2 )). f(y; θ, φ) = exp{(yµ µ 2 /2)/σ 2 (y 2 /σ 2 + log(2πσ 2 ))/2} variance function : b (θ) = 1 {= V (µ)} canonical link : θ = µ = η (identity) b. Poisson Distribution f(y; λ) = λ y e λ /y! f(y; θ, φ) = exp{(y log λ λ) log y!} Matching these expressions up with the expression for the general exponential family, we see θ = log λ and φ = 1. Furthermore, These give mean : b (θ) = exp{θ} = µ = λ a(φ) = 1 b(θ) = exp{θ} = λ c(y; φ) = log y! RJ Cook 4 October 21, 2008

5 CHAPTER 1. LIKELIHOODS FOR GENERALIZED LINEAR MODELS variance : b (θ)a(φ) = exp{θ} 1 = µ variance function : b (θ) = exp{θ} = µ {= V (µ)} canonical link : θ = log µ = η (log link) c. Binomial Distribution If V Bin(m, π), then f(v; π) = ( m v ) π v (1 π) m v, v = 0, 1,..., m. Now consider the transformation to Y = V/m. Then, ( m f(y; π) = π my) my (1 π) m my, y = 0, 1/m,..., 1 We can write this as ( m f(y; π) = exp{my log π + (m my) log(1 π) + log } my) ( m = exp{my log(π/(1 π)) + m log(1 π) + log } my) ( m = exp{[y log(π/(1 π)) + log(1 π)]/m 1 + log } my) Here we have θ = log(π/(1 π)), φ = 1, w = m, and a(φ) = φ = 1/m b(θ) = log(1 + exp{θ}) = log(1 π) ( m c(y; φ) = log my) mean : b (θ) = exp{θ}/(1 + exp{θ}) = µ = π variance : RJ Cook 5 October 21, 2008

6 1.2. ITERATIVELY REWEIGHTED LEAST SQUARES ( ) b (θ)a(φ) = eθ (1 + e θ ) e θ (e θ ) 1 (1 + e θ ) 2 m ( ) ( ) ( ) e θ 1 1 = 1 + e θ 1 + e θ m = µ(1 µ)/m = π(1 π)/m variance function : b (θ) = µ(1 µ) = V (µ) canonical link : θ = log(µ/(1 µ)) = η (logit link) 1.2 Iteratively Reweighted Least Squares The Score Vector This is a way of interpreting the Newton Raphson algorithm for maximization of the likelihood function. Consider the log-likelihood for a single observation from the exponential family Recall l(θ, φ; y) = [(yθ b(θ))/a(φ) + c(y; φ)] l is a function of θ (we initially assume that φ is known) µ can be expressed in terms of θ through µ = b (θ) η can be expressed in terms of µ through the link function and µ = b (θ) η can be expressed in terms of β through η = x β To find an MLE of β, we want to solve S(β) = l/ β = 0. Consider differentiating with respect to a scalar β j. By the chain rule l = l β j θ θ µ µ η η β j where l θ = (y b (θ))/a(φ) ( ) 1 θ µ µ = = 1 θ b (θ) µ η = µ η η = x j β j RJ Cook 6 October 21, 2008

7 CHAPTER 1. LIKELIHOODS FOR GENERALIZED LINEAR MODELS Since µ = b (θ), and V = b (θ), l = y b (θ) 1 β j a(φ) b (θ) µ η x j = y µ a(φ) 1 V µ η x j = y µ Var(Y ) µ = y µ Var(Y ) η x j ( µ η ) 2 η µ x j = (y µ) W η µ x j where W 1 = (Var(Y ))( η/ µ) 2. With n observations l/ β j = n l i / β j i=1 where W 1 i = Var(y i )( η i / µ i ) 2 and η i / µ i means η/ µ evaluated wrt the covariate vector x i. The score vector will then take the form S(β) = ( l/ β 0, l/ β 1,..., l/ β p 1 ). In vector form we can write S(β) as XW(y µ) η/ µ where in vector form y = (y 1,...,y n ) and µ = (µ 1,..., µ n ) are n 1 vectors, X = (x 1,...,x n ) is a p n matrix, W denotes the diagonal matrix with W = W 1 W 2 W j W n, and denotes an elementwise product Newton Raphson and Fisher Scoring Newton Raphson ˆβ (r+1) = ˆβ (r) + I 1 (ˆβ (r) )S(ˆβ (r) ) where I is the observed information matrix. Fisher Scoring Method Fisher suggested using the expected information matrix rather than the observed information matrix. In general, this simplifies the computations as we shall see. Consider, for a single observation (one observation of many, but subscript omitted for convenience) RJ Cook 7 October 21, 2008

8 1.2. ITERATIVELY REWEIGHTED LEAST SQUARES But Therefore, I jk = 2 l = l β j β k β k β j = { (y µ)w β k = (y µ) { W β k I jk = (y µ) { W β k = (y µ) { W β k Taking expectations we get, { } 2 l I jk = E = E β j β k ( ) } η x j µ ( ) } { η x j W µ µ = µ β k η η = µ β k η x k. = β k ( ) } { η x j + W µ ( ) } η x j + x j Wx k µ { (y µ) β k { W ( η µ ( ) } η x j (y µ) µ β k ( ) } η µ x j µ η x k { ( ) }} η W x j + E {x j Wx k } µ ) } x j E {(y µ)} + x j Wx k Notice that the first term vanishes since E{(y µ)} = 0 by definition. Then, for n observations we can write n I jk = x ij W i x ik = (XWX ) jk where again, W is a diagonal matrix W 1 W = W 2 W j W n i=1 and W 1 i = Var{Y i }( η i / µ i ) 2. The Fisher scoring method operates by utilizing ˆβ (r+1) = ˆβ (r) + I 1 (ˆβ (r) )S(ˆβ (r) ) where I is the expected information matrix given above, as opposed to the observed information matrix which is used in the Newton-Raphson algorithm. RJ Cook 8 October 21, 2008

9 CHAPTER 1. LIKELIHOODS FOR GENERALIZED LINEAR MODELS 1.3 Iteratively Re-weighted Least Squares Overview Why is this called the iteratively re-weighted least squares? It is because of the following manipulation : ˆβ (r+1) = ˆβ (r) + I 1 (ˆβ (r) )S(ˆβ (r) ) I(ˆβ (r) )ˆβ (r+1) = I(ˆβ (r) )ˆβ (r) + S(ˆβ (r) ) (XW(ˆβ (r) )X )ˆβ (r+1) = (XW(ˆβ (r) )X ) ˆβ (r) + S(ˆβ (r) ) (XW(ˆβ (r) )X )ˆβ (r+1) = (XW(ˆβ (r) )X )ˆβ (r) + (XW(ˆβ (r) )(y µ(ˆβ (r) )) ( η(ˆβ (r) )/ µ)) ( ) (XW(ˆβ (r) )X )ˆβ (r+1) = XW(ˆβ (r) ) X ˆβ(r) + (y µ(ˆβ (r) )) ( η(ˆβ (r) )/ µ) Let z = η + (y µ) η/ µ. Then ˆβ (r+1) = (XW(ˆβ (r) )X ) 1 XW(ˆβ (r) )z(ˆβ (r) ) This is the same as the weighted LS estimate of β with dependent variable z(ˆβ (r) ) and weight matrix W(ˆβ (r) ). Since we are updating z with each iteration, it is called re-weighted least squares, and since we have to repeat this estimation procedure until convergence, it is called iteratively re-weighted least squares. Note: 1. Consider a Taylor series approximation of g(y) about µ to get g(y) = g(µ) g (µ)(y µ) + g (µ)(y µ) 2 /2 + z = η+(y µ) η can be thought of as a linearized form of the link function. That is, it provides µ a linear approximation to the functional relationship between the mean of the distribution and the linear predictor. 2. This motivates the choice of W. ( ) 2 ( ) 2 η η Var(Z) = Var(Y ) = a(φ)b (θ) µ µ ( ) 2 ( ) 2 1 µ inverse variance W = Var(Z) = 1 µ (Var(y)) 1 = η a(φ)b (θ) η RJ Cook 9 October 21, 2008

10 1.3. ITERATIVELY RE-WEIGHTED LEAST SQUARES When is Fisher Scoring Method Equivalent to Newton Raphson? This question can be equivalently re-phrased as When is the expected information matrix the same as the observed information matrix? Recall I jk = 2 l = (y µ) β j β k β k { W ( ) } { η x j W µ ( ) } η x j (y µ) µ β k Consider the first term of the above expression for the observed information matrix. Recall that Then W = = V = b (θ) = b (θ)/ θ = µ/ θ. 1 a(φ) V ( ) 2 µ η 1 a(φ) µ/ θ = 1 a(φ) θ µ = 1 a(φ) µ η ( µ η ( µ η ) ( ) µ η ) ( ) µ η with canonical link:θ = η Therefore, with the canonical link, { W ( ) } η x j = x j /a(φ) µ and since β k x j /a(φ) = 0 the expected information matrix equals the observed information matrix. Hence, there is no difference between the Newton Raphson algorithm and the Fisher Scoring algorithm. The difference arises when using other (non-canonical) link functions! Questions Problem 1.1. RJ Cook 10 October 21, 2008

11 CHAPTER 1. LIKELIHOODS FOR GENERALIZED LINEAR MODELS Suppose a population is divided into n strata, and we sample m i subjects from the ith stratum, i = 1,...,n. Let Y ij N(µ i, φ) (independently distributed) denote the response for the jth subject in the ith stratum of the sample, j = 1,..., m i, i = 1,..., n. Suppose all that is available are the sample means for each strata, ȳ 1, ȳ 2,...,ȳ n, and an associated p 1 covariate vector x i = (x i0, x i1,...,x i,p 1 ), where x i0 = 1, i = 1,..., n. Then mi f (ȳ i ; µ i, φ) = exp{ m i (ȳ i µ i ) 2 /(2φ)} 2πφ where < y <, < µ i <, and φ > 0. a. Show that the distribution of Ȳi belongs to the exponential family and find the functions a i ( ), b( ), c( ; ), E(Ȳi), Var(Ȳi), and the canonical link function of g(µ) = η. b. Given the data ȳ 1, ȳ 2,...,ȳ n and the linear predictor η i = x i β, find the specific form of the score vector and information matrix for β and explain how you would obtain maximum likelihood estimates of β 0, β 1,...,β p 1. c. Relate the Newton-Raphson algorithm to any other method of model fitting you may have seen before. Problem 1.2. Consider a setting where Y i1,..., Y imi are m i independently distributed Poisson random variables with Y ij Poisson(µ ij ), j = 1,..., m i, i = 1,...,n. Moreover, assume that the Poisson counts are generated by a time homogeneous Poisson process with µ ij = λ i t ij, where λ i is an underlying rate assumed to be common for Y i1,...,y imi and t ij is the duration of observation leading to the count y ij, j = 1,...,m i, i = 1,...,n. Finally, assume that associated with Y i1,...,y imi is a p 1 covariate vector x i = (x i0, x i1,...,x i,p 1 ) where x i0 = 1, i = 1,...,n, and let β = (β 0,..., β p 1 ). a. Write down the likelihood for the rate functions λ = (λ 1,...,λ n ). b. Show that the distribution for Y i = m i j=1 Y ij belongs to the exponential family and hence find the functions a( ), b( ), c( ; ), E(Y i ), Var(Y i ), and the canonical link function g(µ) = η. c. Given summary data y 1, y 2,...,y n and the linear predictors η i = x i β, find the specific form for an entry of the score vector (i.e. l/ β j ) and the expected information matrix (i.e. E( 2 l/ β j β k )) under the canonical link. d. Briefly describe how to obtain maximum likelihood estimates of β 0, β 1,...,β p 1 using a Fisher Scoring algorithm. Problem 1.3. Let Y 1, Y 2,..., Y n be independent Poisson random variables with means µ 1, µ 2,...,µ n respectively, and let Y = (Y 1,...,Y n ). Associated with each Y i is a covariate vector, x i = (1, x i1,...,x i,p 1 ), of length p. a. Show that η i = log µ i is the natural parameter of the Poisson distribution. b. Find the score vector for β. c. Find the observed and expected information matrix for β and hence show how to obtain the MLE for β. d. FOR STAT 831 ONLY Show that T = XY is a vector of sufficient statistics for β where X is the p n matrix with columns comprised of covariate vectors. Problem 1.4. RJ Cook 11 October 21, 2008

12 1.3. ITERATIVELY RE-WEIGHTED LEAST SQUARES Suppose you observe a sample y 1, y 2,...,y n consisting of realizations of n independent Poisson random variables where E(Y i ) = µ i. Suppose that associated with y i is a p 1 vector of explanatory variables (1, x i1, x i2,...,x i,p 1 ). A Poisson regression model with the canonical link takes the form log(µ i ) = β 0 + β 1 x i β p 1 x i,p 1. The likelihood for the vector of regression coefficients β is constructed by making the substitution above into the likelihood for the n means µ i,...,µ n. To answer the following questions, you may either calculate the derivatives using standard methods, or use the general results given in class for the exponential family. a. Write down the score vector for the regression coefficients β i, i = 0,..., p 1. b. Write down the observed and expected information matrix for the regression coefficients. Are they the same or different? Why? c. What is the form of the weight function? What types of observations will have the largest and smallest weights? Problem 1.5. Suppose y i1, y i2,..., y imi are observations from a Gaussian distribution with mean µ i and variance σ 2, i = 1, 2,..., I. Associated with each y i = (y i1, y i2,...,y imi ) is a vector of explanatory variables x i = (x i0, x i1,...,x i,p 1 ). a. Show that ȳ i = n i j=1 y ij/m i is suffient for µ i. b. Show that the distribution of ȳ i is in the exponential family and identify the parameters θ i and φ, and the functions a i (φ), b(θ i ), and c(y i ; φ). c. If we want to set up a regression model, what is the canonical link. d. Write down the score and information function and indicate connections between the Fisher scoring/newton Raphson iterations and another method for estimation regression parameters that you ve encountered before. Problem 1.6. Consider the table below which summarizes data from two samples with independent binomial responses for each. Outcome Present Absent Total Group 1 y m 1 y m 1 Group 2 t y m 2 t + y m 2 Total t m. t m. The conditional distribution of y given t, the first column total, (and m 1 and m 2 ) is ( )( ) m 1 m 2 exp{yα} y t y f(y t, m 1, m 2 ) = ( )( ) m 1 m 2 exp{vα} v t v v S where S = {v : max(0, t m 2 ) v min(m 1, t)}. RJ Cook 12 October 21, 2008

13 CHAPTER 1. LIKELIHOODS FOR GENERALIZED LINEAR MODELS a. Show that this distribution belongs to the exponential family and hence find the canonical parameter, the functions a( ), b( ), c( ; ), E(Y ), var(y ) and the canonical link function. b. Suppose now that we have a series of n independent 2 2 tables of the sort above and let x i = (1, x i1,...,x i,p 1 ) denote a p 1 vector of explanatory variables for the ith table. Introducing subscripts to distinguish data from different tables, we then summarize the data from table i as Outcome Present Absent Total Group 1 y i m 1i y i m 1i Group 2 t i y i m 2i t i + y i m 2i Total t i m.i t i m.i The conditional distribution of y i given t i, the first column total, (and m 1i and m 2i ) is ( )( ) f(y i t i, m 1i, m 2i ) = m 1i y i ( v S i m 1i v m 2i t i y i )( m 2i t i v exp{y i α i } ) exp{vα i } with S i = {v : max(0, t i m 2i ) v min(m 1i, t i )}. Given the data y 1, y 2,...,y n and the linear predictor η i = x iβ where β = (β 0,...,β p 1 ), find the specific form of the score and information function and explain how you would obtain maximum likelihood estimates of β 0, β 1,...,β p 1. Problem 1.7. Consider n 2 2 tables where the ith table is given by Success Failure Total Group 1 y i m i1 y i m i1 Group 2 t i y i m i2 t i + y i m i2 Total t i m i t i m i a. Derive the conditional distribution of Y i given m i1, m i2 and t i and show that this belongs to the exponential family of distributions. Find the canonical parameter and hence find the conditional mean and variance of Y i. b. Suppose that associated with the ith table is a p 1 covariate vector x i = (1, x i1,...,x i,p 1 ), i = 1,...,n. Use the canonical link and the systematic component as η = X β where X is a p n matrix of column vectors x 1,...,x n. Find the score vector and the observed and expected information matrix for β, and hence show how to obtain the MLE of β. Problem 1.8. Suppose a population is divided into n strata, and we sample m i subjects from the ith stratum, i = 1,...,n. Let Y ij Bernoulli(π i ) (independently distributed) denote the response for the jth subject in the ith stratum of the sample, so that P(Y ij = 1) = π i and P(Y ij = 0) = 1 π i, j = 1,..., m i, i = 1,..., n. Suppose that the responses available from the strata are simply the means y 1, y 2,...,y n (where y i = m i j=1 y ij/m i, i = 1,...,n), and that the associated sample sizes, m 1,...,m n, are also available. a. Show that the distribution of y i belongs to the exponential family by identifying the functions a( ), b( ) and c( ; ), obtain E(Y i ) and Var(Y i ), and name the canonical link function. RJ Cook 13 October 21, 2008

14 1.3. ITERATIVELY RE-WEIGHTED LEAST SQUARES b. Suppose that associated with y i is a p 1 covariate vector x i = (1, x i1,...,x i,p 1 ), i = 1,...,n. If β = (β 0,...,β p 1 ) is a p 1 vector of regression coefficients, let η i = x iβ denote the linear predictor for stratum i, i = 1,...,n. Find the specific form of the score vector and information matrix and explain how you would obtain the maximum likelihood estimate of β, which we denote as ˆβ. c. What is the mean vector and covariate matrix for the asymptotic distribution of the MLE ˆβ? d. STAT 831 ONLY Is there any information lost about the vector of regression coefficients, β, when only the sample means and the sample sizes are available from each stratum (as opposed to when the individual subjects responses are available)? Explain. RJ Cook 14 October 21, 2008

15 2 Basic methods for the Analysis of Binary Data 2.1 Introduction Binary responses generally require different analysis techniques than have been considered so far in regression courses. Examples of binary responses include disease status (diseased/not diseased) and survival status (dead/alive). In addition to have such a binary response, we often have a single binary covariate of interest. Examples include treatment (experimental/control) or exposure (exposed to radiation/not exposured to radiation) variables. To summarize the data, we might construct a 2 2 table as follows. Table 2.1. A 2 2 Table Disease Present Absent Group 1 y 1 m 1 y 1 m 1 Group 2 y 2 m 2 y 2 m 2 Total y. m. y. m. If we have m 1 and m 2 fixed, we typically assume we ve got two independent binomial samples with Y k Bin(m k, π k ), k = 1, 2. There are many measures of association one can consider for such tables, but here we will focus on the odds ratio. The odds of one event versus another is simply the ratio of their respective probabilities. Therefore, the odds of disease versus no disease in Group 1 is π 1 /(1 π 1 ). We sometimes just refer to this as the odds of disease in Group 1. In this context, the odds is a 1-1 monotonically increasing function of π 1 which takes on values on the non-negative real line. The odds of disease in Group 2 is π 2 /(1 π 2 ). The odds ratio reflecting the relative odds of disease in Group 1 versus Group 2 is then ψ = π 1/(1 π 1 ) π 2 /(1 π 2 ). Note that in the case of a rare disease (i.e. when π 1 and π 2 are very small, then ψ is close to the relative risk, π 1 /π 2. This can be seen by noting that ψ = π 1/(1 π 1 ) π 2 /(1 π 2 ) = π ( ) 1 1 π2. π 2 1 π 1 When π 1 π 2 is small the fraction in parentheses is close to 1.

16 2.2. ESTIMATION OF THE ODDS RATIO 2.2 Estimation of the Odds Ratio We would like to use likelihood theory to estimate ψ and therefore need to construct an appropriate likelihood function. Note that ( ) ( ) m1 Pr(Y 1 = y 1, Y 2 = y 2 ) = π y 1 y 1 (1 π 1) m 1 y 1 m2 π y 2 1 y 2 (1 π 2) m 2 y 2 2 L(π 1, π 2 ) = π y 1 1 (1 π 1) m 1 y 1 π y 2 2 (1 π 2) m 2 y 2 ( ) y1 ( ) y2 π1 L(π 1, π 2 ) = (1 π 1 ) m π2 1 (1 π 2 ) m 2 1 π 1 1 π 2 ( ) y1 ( ) y2 +y π1 /(1 π 1 ) 1 π2 L(π 1, π 2 ) = (1 π 1 ) m 1 (1 π 2 ) m 2 π 2 /(1 π 2 ) 1 π 2 We want to reparameterize to get rid of π 1 and so note that if ψ = π 1 /(1 π 1 )/[π 2 /(1 π 2 )] then π 1 = ψπ 2 /[1 π 2 + ψπ 2 ]. Substituting into the above likelihood we get ( ) y2 +y 1 [ ] m1 L(ψ, π 2 ) = ψ y π2 (1 π 1 2 ) (1 π 2 ) m 2 1 π 2 (1 π 2 + ψπ 2 ) Now that we have a likelihood involving the parameters of interest, we can consider further reparameterization to enable us to obtain Wald-type quantities for inference. Wald-type quantities are most appealing when the corresponding parameters are unrestricted (i.e. the parameter space is the real line). Therefore consider reparameterizing to β = log ψ and α = log(π 2 /(1 π 2 )). Here we get where y = y 1 + y 2. Then the log-likelihood is L(α, β) = e y 1β (1 + e α+β ) m 1 e y.α (1 + e α ) m 2, l(α, β) = y 1 β m 1 log(1 + e α+β ) + y.α m 2 log(1 + e α ). Differentiating with respect to α and β we get S α (α, β) = S β (α, β) = l(α, β) α l(α, β) β = m 1e α+β 1 + e α+β + y. m 2e α 1 + e α = y 1 m 1e α+β 1 + e α+β If S(α, β) = (S α (α, β), S β (α, β)), solving S(α, β) = 0 gives ˆα = log(y 2 /(m 2 y 2 )) and ˆβ = log( y 1/(m 1 y 1 ) y 2 /(m 2 y 2 ) ). These estimates are natural since they imply ˆπ 2 = eˆα /(1+eˆα ) = y 2 /m 2, ˆπ 1 = y 1 /m 1 and ˆψ = ˆπ 1 /(1 ˆπ 1 )/[ˆπ 2 /(1 ˆπ 2 )]. Note that RJ Cook 16 October 21, 2008

17 CHAPTER 2. BASIC ANALYSIS OF BINARY DATA ] e I αα = [ m α+β (1 + e α+β ) e α+β e α+β e α (1 + e α ) e 2α 1 m (1 + e α+β ) 2 2 (1 + e α ) 2 e α+β = m 1 (1 + e α+β ) + m 2 2 (1 + e α ) 2 ] e I αβ = [ m α+β (1 + e α+β ) e α+β e α+β 1 (1 + e α+β ) 2 e α+β = m 1 (1 + e α+β ) 2 ] e I ββ = [ m α+β (1 + e α+β ) e α+β e α+β 1 (1 + e α+β ) 2 = m 1 e α+β (1 + e α+β ) 2 e α We are interested in the (β, β) entry of I 1 which we denoted by I ββ in section 1.2. This is given by I ββ (α, β) = [I ββ I βα I 1 ααi αβ ] 1 and here we obtain which we evaluate at ˆα, ˆβ to give I ββ (α, β) = 1 E(Y 1 ) + 1 E(m 1 Y 1 ) + 1 E(Y 2 ) + 1 E(m 2 Y 2 ), I ββ (ˆα, ˆβ) = 1 y m 1 y 1 y 2 m 2 y 2 Proof : First note that we can write I αα = m 1 π 1 (1 π 1 ) + m 2 π 2 (1 π 2 ), I αβ = I βα = m 1 π 1 (1 π 1 ) and I ββ = m 1 π 1 (1 π 1 ). Now we get [I 1 ] ββ = I ββ = [I ββ I βα I 1 αα I αβ] 1 = [m 1 π 1 (1 π 1 ) (m 1 π 1 (1 π 1 )) 2 /(m 1 π 1 (1 π 1 ) + m 2 π 2 (1 π 2 ))] 1 m 1 π 1 (1 π 1 )m 2 π 2 (1 π 2 ) = [ m 1 π 1 (1 π 1 ) + m 2 π 2 (1 π 2 ) ] 1 1 = m 1 π 1 (1 π 1 ) + 1 m 2 π 2 (1 π 2 ) = m 1 π 1 m 1 (1 π 1 ) m 2 π 2 1 m 2 (1 π 2 ) = 1 E(Y 1 ) + 1 E(m 1 Y 1 ) + 1 E(Y 2 ) + 1 E(m 2 Y 2 ). Given this result we may obtain a Wald-type approximate 95% CI for β as ( [ 1 1 ˆβ ] 1/2 [ , ˆβ ] ) 1/2 1 + y 1 m 1 y 1 y 2 m 2 y 2 y 1 m 1 y 1 y 2 m 2 y 2 which we denote as (ˆβ L, ˆβ ) (eˆβ U ). An approximate 95% CI for ψ is then given by L, eˆβ U. We are not typically that interested in α or π 2 in such 2 2 tables. Note that given the likelihood and RJ Cook 17 October 21, 2008

18 2.3. MULTIPLE REGRESSION FOR BINARY RESPONSES the asymptotic (large sample) results in section 1.2 we could use the likelihood itself to conduct inference about β and hence ψ. The Wald-type pivotal used here is much more convenient and since the range of values for β is unrestricted the results will generally agree very closely. 2.3 Multiple Regression for Binary Responses The results of the preceding section were directed at the case with a single factor variable with two levels and a binary response. This is a simple setting but more often we need multiple regression methodology since we may a. want to be able to control for confounding variables and hence want to examine the effect of several (possibly related collinear) variables simultaneously, b. want to examine the effect of categorical covariates (>2 levels) or continuous covariates. c. want to develop sophisticated models that describe complex relationships. Example : Consider the data in the following table which describes the relationship between the level of prenatal care and fetal mortality. The data arose from two clinics which we refer to as Clinic A and Clinic B (not their real names!). Table 2.2. Prenatal Care Data from Two Clinics. Died Survived Total Intensive Regular Here we obtain ˆψ = ˆπ 1 /(1 ˆπ 1 )/[π 2 /(1 π 2 )] = 20/316/[46/373] = 0.51 and a 95% CI of ψ of (0.30, 0.89). This suggests a strong association between level of prenatal care and fetal mortality. However if we consider data from just those subjects who are at Clinic A, we get the following table. This gives ˆψ = /[12 293] = 0.80 with a 95% CI for ψ of (0.37, 1.73). While the odds ratio Table 2.3. Prenatal Care Data from Patients at Clinic A Died Survived Intensive Regular estimate is in the direction of a protective effect with intensive prenatal care, the confidence interval is quite wide and includes values above one (which correspond to an increased risk of mortality). Now we consider the corresponding data from Clinic B. This gives ˆψ = 4 197/[23 34] = 1.01 with a 95% CI for ψ of (0.33, 3.10). Note that the reduction in the odds of fetal mortality due to intensive prenatal care which we observed in the RJ Cook 18 October 21, 2008

19 CHAPTER 2. BASIC ANALYSIS OF BINARY DATA Table 2.4. Prenatal Care Data from Patients at Clinic B Died Survived Intensive Regular pooled data, appears to have vanished for patients in Clinic B, and we found above that the estimate of benefit is considerably smaller (and no longer significant) among patients in Clinic A. For further investigation, we examine the relationship between clinic and level of care. Table 2.5. The Association Between Clinic and Level of Care Clinic A Clinic B Intensive Regular Here we get ˆψ = with a 95% CI for ψ of (9.12, 21.76). This suggests a very strong and statistically significant relationship between clinic and the intensity of prenatal care. Specifically, we can see that the proportion of patients in Clinic A receiving intensive prenatal care is considerably higher than it is for patients in Clinic B. The following table displays the relationship between clinic and mortality. Table 2.6. The Association Between Clinic and Mortality Rate Clinic A Clinic B Died Survived Here we obtain ˆψ = 0.35 with a 95% CI (0.21, 0.58). This suggests that there is a statistically significantly (significant at the 5% level since the 95% confidence interval does not include one) higher rate of mortality in Clinic B. Finally, we can tabulate the relationship between clinic and mortality stratified by level of care. We do that in the following table. We will return to this example shortly. To summarize what we found here, we found that there is an apparent strong association between level of prenatal care and fetal mortality. When stratifying by clinic, evidence of this apparent association is greatly reduced. When we stratify by level of prenatal care, there is a reduced risk of mortality for patients in Clinic A versus Clinic B. We aim to study how these findings might be reflected in a regression model. RJ Cook 19 October 21, 2008

20 2.4. SETTING UP A BINOMIAL REGRESSION MODEL Table 2.7. Stratified Tabulation of Clinic Effects Intensive Regular Died Survived Died Survived Total Clinic A Clinic B Setting up a Binomial Regression Model Introduction and Notation Let x 1, x 2,..., x p 1 be a set of p 1 explanatory variables and x = (1, x 1, x 2,..., x p 1 ) be a p 1 vector of explanatory variables. Let β = (β 0, β 1,..., β p 1 ) be a p 1 vector of parameters. The scalar quantity η = x β = β 0 + β 1 x β p 1 x p 1 is called the linear predictor. Let x i = (1, x i1, x i2,..., x i,p 1 ) be the vector of covariates for the ith subject, i = 1, 2,..., n. Define X = x 11 x 21 x n1.. x 1,p 1 x 2,p 1 Then the vector of linear predictors is given by η 1. x n,p 1 η =. = X β η n = (x 1, x 2,..., x n ) Recall in the context of the Gaussian linear model Y i N(µ i, σ 2 ) are independent and we set E(Y i ) = µ i = η i = β 0 + β 1 x i1 + β 2 x i2 + + β p 1 x i,p 1, Now consider binomial data with Y i Bin(m i, π i ). We might think of a regression model of the form E(Y i /m i ) = π i = η i = β 0 + β 1 x i1 + + β p 1 x i,p 1 Is this a reasonable model? It is not particularly convenient to work with since we ve got to impose constraints on the RHS because 0 π i 1, i = 1, 2,..., n. Therefore, rather than working with π i directly, we work with a function of it. The so-called link function defines such a transformation, which typically maps [0, 1] (, + ). We denote the link function by g(π). Name of Link Function Expression Identity g(π) = π log-log Probit Logit g(π) = log( log(π)) g(π) = Φ 1 (π) g(π) = log(π/(1 π)) RJ Cook 20 October 21, 2008

21 CHAPTER 2. BASIC ANALYSIS OF BINARY DATA Φ is the cdf for a standard normal random variable. Having selected the logit link, our regression model takes the form g(π) = η. Introducing the subscript i to distinguish individuals we have The Logit Link and Odds Ratios π i log( ) = x 1 π iβ = β 0 + β 1 x i1 + + β p 1 x i,p 1 i Let Y 1 and Y 2 denote the number of individuals with the outcome in Groups 1 and 2 respectively. We let Y k Bin(m k, π k ), k = 1, 2. Let x i = 1 if the ith individual is in Group 1 and x i = 0 otherwise. We now consider a model for each individual s response (which is binary, not binomial) with the following form ( ) πi log = β 0 + β 1 x i. 1 π i Consider the log odds for a subject in group 1 as ( ) πi log = β 0 + β 1 1 π i and the log odds for a subject in group 2 as This implies that log ) = β 0 1 π j ( πj ( ) ( ) πi πj log log = β 1 1 π i 1 π j which means log ψ = β 1 where ψ is the odds ratio comparing the odds of an event for a subject in group 1 versus a subject in group 2. Therefore, the regression coefficient from this logistic model may be interpreted as a log odds ratio describing the association between group membership and the outcome. Frequently we are interested in the parameter π itself. In this case note that ( ) πi log = β 0 + β 1 x i 1 π i π i = e β 0+β 1 x i 1 π i eβ 0+β 1 x i π i = 1 + e β 0+β 1 x i In a Gaussian model, given ˆβ, the fitted value for E(y i ) is ˆµ i (x i ) = x i ˆβ. In this binomial regression model, the fitted value for E(Y i /m i ) is eˆβ 0 +ˆβ 1 x i1 ˆπ i = ˆπ(x i ) = 1 + eˆβ 0 +ˆβ 1 x i1 More generally, these fitted values may be written as RJ Cook 21 October 21, 2008

22 2.4. SETTING UP A BINOMIAL REGRESSION MODEL ˆπ i = ˆπ(x i ) = exp(x ˆβ) i 1 + exp(x ˆβ) i Now consider the case with two binary explanatory variables. Let { 1 if factor A present x i1 = 0 otherwise { 1 if factor B present x i2 = 0 otherwise { 1 if A and B present x i3 = 0 otherwise Consider the model ( ) πi log = β 0 + β 1 x i1 + β 2 x i2 1 π i and interpret the effect of x i1. First we compute the log odds when factors A and B are present, where x i = (1, 1, 1). Here we get ( ) πi log = β 0 + β 1 + β 2. 1 π i Then we get the log odds when factors A is absent but B present by noting that x j = (1, 0, 1) and ( ) πj log = β 0 + β 2. 1 π j Taking the difference of these log odds gives ( ) ( ) πi πj log log = log 1 π i 1 π j ( ) πi (1 π j ) = β 1 π j (1 π i ) Therefore β 1 is again the log odds ratio reflecting the effect of factor A, but this time we are controlling for factor B and are specifying that factor B is present. Note that the effect of factor A is the same regardless of the level of B. To see this we examine the log odds when factors A is present B absent, where x i = (1, 1, 0) and ( ) πi log = β 0 + β 1 1 π i and the log odds when factor A is absent and B is absent, where x j = (1, 0, 0) and ( ) πj log = β 0. 1 π j Again the difference gives log ( ) πi /(1 π i ) = β 1. π j /(1 π j ) RJ Cook 22 October 21, 2008

23 CHAPTER 2. BASIC ANALYSIS OF BINARY DATA Now consider the model ( ) πi log = β 0 + β 1 x i1 + β 2 x i2 + β 3 x i3 1 π i where we have introduced an interaction term. The log odds when factor A and B is present is obtained by noting that x i = (1, 1, 1, 1) and so ( ) πi log = β 0 + β 1 + β 2 + β 3. 1 π i When factor A is absent and B is present x j = (1, 0, 1, 0) giving ( ) πj log = β 0 + β 2. 1 π j Taking the difference again we find ( ) ( ) πi πj log log = β 1 + β 3 1 π i 1 π j ( ) πi (1 π (giving log j ) π j (1 π i = β ) 1 + β 3 ). The log odds when factor A is present and B is absent leads to x i = (1, 1, 0, 0) and ( ) πi log = β 0 + β 1 1 π i The log odds when factor A is absent and B is absent is (x i = (1, 0, 0, 0) ). ( ) πj log = β 0 giving log 1 π j ( ) πi (1 π j ) π j (1 π i = β ) 1. With interaction term, we find that the effect of factor A depends on the presence or absence of factor B. If factor B is absent the log odds ratio relating A to the outcome is β 1, but if factor B is present it is β 1 + β 3. Now we consider the data from the prenatal care example from before. Here we set up a regression model for the analyses of interest. Again our response Y i Bin(m i, π i ), i = 1, 2,..., n and we have explanatory variables { 1 Clinic A x i1 = 0 Clinic B { 1 intensive level of care x i2 = 0 regular level of care { 1 intensive level of care and Clinic A x i3 = 0 otherwise Before considering the data analysis, we note that the parameters of the model can often be interpreted quickly and more easily if we write down the following tables. For a model with ( ) π log = β 0 + β 1 x i1 + β 2 x i2 1 π we can write RJ Cook 23 October 21, 2008

24 2.4. SETTING UP A BINOMIAL REGRESSION MODEL Clinic Level of Care x i π i /(1 π i ) B regular (1, 0, 0) e β 0 B intensive (1, 0, 1) e β 0 +β 2 A regular (1, 1, 0) e β 0+β 1 A intensive (1, 1, 1) e β 0+β 1 +β 2 where we make use of the fact that π i /(1 π i ) = exp(x iβ). The last column reports the odds of mortality for the four combinations of the risk factors. If we divide the corresponding terms we can obtain odds ratios. For example among those patients in Clinic B the relative odds of mortality for those with intensive versus regular care is e β 0+β 2 e β 0 = e β 2. Among those in Clinic A the relative odds of mortality for those with intensive versus regular care is e β 0+β 1 +β 2 e β 0+β 1 = e β 2 the same expression we got before for those in Clinic B. By similar methods we can see that the relative odds of mortality for those in Clinic A versus Clinic B is e β 1 regardless of their drinking status. Now consider the model ( ) πi log = β 0 + β 1 x i1 + β 2 x i2 + β 3 x i3 1 π i where now x i = (1, x i1, x i2, x i3 ) and β = (β 0, β 1, β 2, β 3 ). Clinic Level of Care x i π i /(1 π i ) B regular (1, 0, 0, 0) e β 0 B intensive (1, 0, 1, 0) e β 0 +β 2 A regular (1, 1, 0, 0) e β 0+β 1 A intensive (1, 1, 1, 1) e β 0+β 1 +β 2 +β 3 Here the odds ratio of mortality for those with intensive versus regular care among those in Clinic B is e β 0+β 2 e β 0 = e β 2. However the corresponding odds ratio among those in Clinic A is e β 0+β 1 +β 2 +β 3 e β 0+β 1 = e β 2+β 3. If β 3 = 0 then the effect of level of prenatal care does not depend on the clinic and vice versa. If β 3 = 0 and β 2 = 0 as well, it means not only does the effect of level of care not depend on clinic, but there is no such effect. RJ Cook 24 October 21, 2008

25 CHAPTER 2. BASIC ANALYSIS OF BINARY DATA Logistic Regression Analysis of Prenatal Care Data What follows is the data file prenatal.dat in which the first line contains the variable labels and the remaining four lines the data. As before we are using indicator variables for the explanatory variables and have binomial response data. clinic loc y m The program used to analyse the data is given below. Splus program for analysis of prenatal care data help.start() prenatal.dat_read.table("prenatal.dat", header=t) % here we construct the response variable for the logistic regression analysis prenatal.dat$resp_cbind(prenatal.dat$y,prenatal.dat$m-prenatal.dat$y) prenatal.dat % now we fit the model using the glm function and store the result in "model1" % we indicate "resp" contains a binomial response and that we are using the % logistic link function model1_glm(resp ~ loc, family=binomial(link=logit),data=prenatal.dat) summary(model1) % the "names" function lists the contents of the object "model1" and following % this statement we examine some of the contents of these objects (try it) names(model1) model1$family model1$formula model1$coefficients model1$deviance model1$fitted.values model1$residuals % now we fit a model to examine the relationship between level of care % and mortality adjusting for clinic model2_glm(resp ~ clinic + loc, family=binomial(link=logit),data=prenatal.dat) summary(model2) % here we examine whether the association between loc and mortality depends on % the clinic model3_glm(resp ~ loc + clinic + loc*clinic, family=binomial(link=logit),data=prenatal.dat) summary(model3) % now we examine the marginal relationship between mortality and clinic model4_glm(resp ~ clinic, family=binomial(link=logit),data=prenatal.dat) summary(model4) A selection of the output printed from the summary commands summary(model1), summary(model2), summary(model3) and summary(model4) follows. > prenatal.dat$resp_cbind(prenatal.dat$y,prenatal.dat$m-prenatal.dat$y) > prenatal.dat clinic loc y m resp.1 resp Here we print out the augmented dataframe to see what the resp variables looks like. Next we fit the regression model examining the relationship between the level of care and mortality. A portion of the output is reported below. RJ Cook 25 October 21, 2008

26 2.4. SETTING UP A BINOMIAL REGRESSION MODEL > model1_glm(resp ~ loc, family=binomial(link=logit),data=prenatal.dat) > summary(model1) Coefficients: Value Std. Error t value (Intercept) loc Null Deviance: on 3 degrees of freedom Residual Deviance: on 2 degrees of freedom Number of Fisher Scoring Iterations: 3 The numbers under the heading Value are the maximum likelihood estimates of the regression coefficients ˆβ 0 and ˆβ 2 (Here we are using the convention that the subscripts for the regression coefficients coincide with the subscripts on the variables themselves). The numbers under Std. Error are estimated standard errors based on the inverse of the information matrix (more on this shortly). Finally, the numbers under t value are Wald-type test statistics for testing the hypothesis that H 0 : β k = 0 vs. H 0 : β k 0. Note that these are of the form (ˆβ k 0)/s.e.(ˆβ k ). These test statistics are approximately standard normal if the null hypothesis is true. To verify note that for testing the effect of level of care based on model 1 we find / = The p values can be computed as p value = 2 pr(u > ( ˆβ k 0)/s.d.(ˆβ k ) ) where U N(0, 1). Therefore, if we test the hypothesis that there is no relation between level of care and mortality (H 0 : β 2 = 0.0) we get 2 pr(u > ) = 2 (1 pnorm( )) = Here we conclude that those patients receiving more intensive care are at a significantly lower risk of mortality than those receiving standard level of care, and that this evidence is rather strong. To further characterize this dependence we need to conduct inference about the odds ratio. Recall that exp(β 2) is the odds ratio parameter, exp(ˆβ 2) is the MLE, and (exp(ˆβ s.e.(ˆβ 2)),exp(ˆβ s.e.(ˆβ 2))) is an approximate 95 % confidence interval. Here we get a point estimate of exp( ) = 0.51 and a 95 % CI of (exp( ), exp( )) = (0.30, 0.89) as we did before. In fact the analysis before was exactly the same as this one and the results will be identical apart from rounding. Finally note that while the Wald statistic is computed for β 0, it is seldom of interest. Instead we tend to focus on coefficients of explanatory variables which have useful interpretations. Now we consider introducing clinic into the model. This generates a model in which we can examine the effect of level of care on mortality, but adjusted for the clinic the patient attended. > model2_glm(resp ~ clinic + loc, family=binomial(link=logit),data=prenatal.dat) > summary(model2) Coefficients: Value Std. Error t value (Intercept) clinic loc (Dispersion Parameter for Binomial family taken to be 1 ) Null Deviance: on 3 degrees of freedom Residual Deviance: on 1 degrees of freedom Number of Fisher Scoring Iterations: 3 Here we see that there is no longer any evidence that there is a relationship between level of care and mortality. The association that did exist has been explained away by the clinic the patients attended. To see this, note that a test of H 0 : β 2 = 0, while controlling for clinic, gives a p value as 2 pr(u > ) = 2 (1 pnorm( )) This is interesting, but it does not mean that there is no effect of level of care for any patient. There may be an interaction between level of care and clinic, for example, and the level of care variable may be significant in one of the clinics. To check this, we therefore fit the model with loc and clinic main effects as well as the loc*clinic interaction. RJ Cook 26 October 21, 2008

27 CHAPTER 2. BASIC ANALYSIS OF BINARY DATA > model3_glm(resp~loc+clinic+loc*clinic,family=binomial(link=logit),data=prenatal.dat) > summary(model3) Call: glm(formula = resp ~ loc + clinic + loc * clinic, family = binomial(link = logit), data = prenatal.dat) Coefficients: Value Std. Error t value (Intercept) loc clinic loc:clinic (Dispersion Parameter for Binomial family taken to be 1 ) Null Deviance: on 3 degrees of freedom Residual Deviance: 0 on 0 degrees of freedom Number of Fisher Scoring Iterations: 3 It is not surprising that this interaction term is not significant, since when we tabulated the data we found that the odds ratios relating level of care with mortality within the two strata were quite similar. Finally we fit the model involving just the clinic main effect. > model4_glm(resp ~ clinic, family=binomial(link=logit),data=prenatal.dat) > summary(model4) Call: glm(formula = resp ~ clinic, family = binomial(link = logit), data = prenatal.dat) Deviance Residuals: Coefficients: Value Std. Error t value (Intercept) clinic (Dispersion Parameter for Binomial family taken to be 1 ) Null Deviance: on 3 degrees of freedom Residual Deviance: on 2 degrees of freedom Number of Fisher Scoring Iterations: 3 When testing the hypothesis H 0 : β 1 = 0 versus H 0 : β 1 0 we get 2 pr(u > ) = 2 (1 pnorm( )) < Here we conclude that those patients at clinic A are at a significantly lower risk of mortality than those at clinic B and that this evidence is very strong. Here we get a point estimate for the odds ratio reflecting the reduction in odds of mortality in Clinic A compared to Clinic B as exp( ) = and a corresponding 95 % CI as (exp( , (exp( ) = (0.21, 0.58) 2.5 Likelihood for Binary Regression The responses y 1, y 2,..., y n as observations from n random variables Y 1...Y n where Y i Bin(m i, π i ), i = 1, 2,..., n. ( ) n mi pr(y = y; π) = π yi i (1 π i) mi yi which gives i=1 y i RJ Cook 27 October 21, 2008

28 2.5. LIKELIHOOD FOR BINARY REGRESSION L(π; y) = l(π; y) = = n i=1 π yi i (1 π i) mi yi n [y i log π i + (m i y i )log(1 π i )] i=1 n [y i log i=1 ( πi 1 π i ) + m i log(1 π i )] Modeling procedures are based on expressing π 1, π 2,..., π n in terms of fewer parameters (this is with a view of data reduction). These new parameters take the form of regression coefficients. This is done through a link function. The logistic link g(π) = log ( πi 1 π i ) = x i β implies π i = 1 + e x i β If the dimension of β is less than n, we say we have an unsaturated model. In this case we take the dimension to be p, corresponding to a model with p 1 covariates. Under some models with the dimension of β equal to n, we have a saturated model. Returning to the log-likelihood: l(β; y) = i=1 ex i β n [ ( )] y i (x 1 iβ) + m i log = 1 + e x i β n i=1 [ ] y i (x iβ) m i log(1 + e x i β ) Upon maximizing l wrt β, we obtain ˆβ and compute ˆπ = e x i ˆβ/1+e x i ˆβ. The quality of the fit of these regression models will be judged by how well ˆπ 1, ˆπ 2,..., ˆπ n fit the data (or equivalently, how well m iˆπ i approximates y i, i = 1, 2,..., n). We need a criterion to assess how much worse unsaturated models are from the saturated model. A convenient way of testing nested hypotheses is based on the likelihood ratio statistic. Likelihood Ratio Tests: L(θ) is a likelihood of a q dimensional parameter vector θ it may be maximized with no constraints on θ giving θ, or subject to constraints on ˆθ. In the latter case the effect dimension of θ will be denoted by p. Note that regression models may be interpreted as imposing constraints on the mean responses. That is we force relationships between the µ i values in linear regression, or π i values in binary regression. We may then formulate a hypothesis that the constraint is a reasonable one a test is by seeing how consistent it is with the data. The likelihood ratio statistic 2 log(l(ˆθ)/l( θ)) has a χ 2 distribution on ν = q p degrees of freedom if the null hypothesis that the constraints are reasonable is true, where ν is the difference in the effective number of parameters with and without the constraints. Therefore if 2 log(l(ˆθ)/l( θ)) > χ 2 ν (α) we would reject H 0 at the α significance level. It is more informative to examine the p value and so we compute Returning to the log likelihood for binomial data we have l(π; y) = p = pr(χ 2 q p > 2 log(l(ˆθ)/l( θ))) n [y i log i=1 ( πi 1 π i ) + m i log(1 π i )] RJ Cook 28 October 21, 2008

29 CHAPTER 2. BASIC ANALYSIS OF BINARY DATA let π = ( π 1,..., π n ) = (y 1 /m 1,..., y n /m n ) represent the MLE under the saturated model and let ˆπ = (ˆπ 1,, ˆπ n ) denote the MLE under the constrained model imposed by the regression equation. With a little algebra one can show that the LR statistic 2 log(l(ˆπ)/l( π)) = 2(l(ˆπ) l( π)) obtained by substituting the appropriate MLE s into the log-likelihood has the form " X n» 2(l( π) l(ˆπ)) = 2 y i log(y i/m i) + (m i y i) log = 2 i=1 " m X i=1 mi y i m i «# nx [y i log ˆπ + (m i y i)log(1 ˆπ i] i=1» ««# yi mi y i y i log + (m i y i) log m iˆπ i m i(1 ˆπ i) This likelihood ratio statistic is central to the analysis of binary regression models and so has a special name. It is called the deviance statistic and is represented by D if we think of it as random and d if we refer to a realized value for it. Splus reports this as the residual deviance, and sometimes is will be called the scaled deviance for reasons that will become clear shortly. Based on the general result above, we would expect it to have a χ 2 distribution on n p degrees of freedome. Unfortunately, this distributional approximation for the deviance statistic is not as good as one might hope! It does perform very well, however, for testing nested unsaturated models which we will consider in the next section. We remark in passing that the deviance statistic has the form 2 ( ) Oij O ij log where O ij is an observed quantity and E ij is an expected quantity. We use two subscripts here since we are summing over both the y i cells and the (m i y i ) cells. The Pearson statistic is another statistic one can use for assessing overall fit of a model. P = n i=1 E ij (y i m iˆπ i ) 2 m iˆπ i (1 ˆπ i ) which has the form (O i E i ) 2 /V i. As for the deviance statistic P χ 2 n p approximately if the model provides a reasonable fit to the data (i.e. if the assumed model is true ). The Chi-square approximation is a little bit better than for deviance statistics. Both however are poor if sample size (m i ) are small. The deviance and Pearson statistics can be shown to be asymptotically equivalent by a Taylor series expansion. 2.6 Testing Nested Non-saturated Models Suppose we have a model log(π i /(1 π i )) = β 0 + β 1 x i1 + + β p 1 x ip 1 and another model log(π i /(1 π i )) = β 0 + β 1 x i1 + + β p 1 x ip π + + β q 1 x i,q 1 We may be interested in testing whether the first model, which is a sub-model of the second, provides as good a fit to the data. This is equivalent to testing the significance of the covariates x p,...,x q 1, or to testing H 0 : β p = = β q = 0. Let ˆπ i denote the MLE of π i under the reduced model with p parameters and let π i denote the MLE of π i under the full model with q parameters. Again with a little algebra one can show that the likelihood ratio statistici corresponding to this test is given by the difference in the deviance of the two models. That is the appropriate likelihood ratio test statistic is D = D 0 D A where D 0 is the deviance under the null model and D A RJ Cook 29 October 21, 2008

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

ST3241 Categorical Data Analysis I Generalized Linear Models Introduction and Some Examples 1 Introduction We have discussed methods for analyzing associations in two-way and three-way tables. Now we will