Running head: QUASI-NONCONVERGENCE LOGISTIC REGRESSION 1. Handling Quasi-Nonconvergence in Logistic Regression:

Size: px

Start display at page:

Download "Running head: QUASI-NONCONVERGENCE LOGISTIC REGRESSION 1. Handling Quasi-Nonconvergence in Logistic Regression:"

Berniece Grant
6 years ago
Views:

1 Running head: QUASI-NONCONVERGENCE LOGISTIC REGRESSION 1 Handling Quasi-Nonconvergence in Logistic Regression: Technical Details and an Applied Example Jeffrey M. Miller Northcentral University M. David Miller University of Florida Author Note Jeffrey M. Miller, Ph.D., College of Education, Northcentral University M. David Miller, Ph.D., Research & Evaluation Methodology, University of Florida Correspondence regarding this article should be addressed to Jeffrey M. Miller, 4117 SW 20 th Ave., #73, Gainesville, FL contact: jeffmiller.research@gmail.com

2 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 2 Abstract Nonconvergence is a concern for any iterative data analysis process. However, there are instances in which convergence will be obtained for the overall solution but not for a specific estimate. For most software packages, it is not easy to notice this problem unless the researcher has a priori knowledge of reasonable solutions. Hence, faulty inferences can be disguised by a presumably correct estimation procedure known as quasi-nonconvergence. This type of nonconvergence occurs in logistic regression models when the data are quasi-completely separated. This is to say that prediction is completely or nearly completely perfect. Firth (1993) presented a penalized likelihood correction that was then extended by Heinze and Ploner (2003) to solve the quasi-nonconvergence problem. This procedure was applied to educational research data to demonstrate its success in eliminating the problem. Keywords: quasi-nonconvergence nonconvergence logistic regression

3 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 3 Handling Quasi-Nonconvergence in Logistic Regression: Technical Details and an Applied Example Many researchers have experienced nonconvergence errors where, for one reason or another, a maximum likelihood solution can not be calculated or does not exist. Conditions for nonconvergence include sparseness of data, multiple maximas, unspecified boundary constraints, and data separation. Logistic regression analyses are especially prone to data separation issues resulting in complete or quasi-complete separation. This article provides calculations for ML estimates in logistic regression, a solution to the data separation problem, and an example using real data. Logistic regression is often used to describe and/or predict a binary outcome given a set of covariates. The technique is widely used in research including the topics of dropping out of school (Suh, Suh, & Houston, 2007), retention and graduation (Wohlgemuth, Whalen, Sullivan, Nading, Mack, & Wang, 2007), and skipping classes (Kimberly, 2007). In the parlance of generalized linear modeling (MaCullagh & Nelder; 1989; Nelder & Wedderburn, 1972), the logistic function has a random, a systematic, and a linking component. The random component specifies information about the response variable. In this case, given n randomly selected, identically distributed, and independent trials, the random component is specified as Y ~ B( n, ) (Agresti, 1996). This is to say that outcome Y has a binomial (B) distribution with the n parameter representing the number of trials and the π parameter representing the probability that Y equals one. This As explained by McCullagh and Nelder (1989), the area under the curve for k successes, or the probability mass function for Pr(Y = 1), can be calculated by hand when

4 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 4 provided the total number of trials (n), the number of successes (k) and the probability of a success. n! k n k f ( k; n, p) [ p (1 p) ]. (1) k!( n k)! The systematic component for the logistic regression model contains the P covariates that are specified as predictors of the probability that Y = 1 through their P estimators. The link function specifies how the random component should be related to the systematic component. For typical regression analyses (i.e., ordinary least squares regression), we equate the mean of the response variable to the predictor variables and have assumed the identity link. However, if we do this with a binary response variable then we can feasibly obtain a regression prediction equation that permits predicted values less than zero or greater than one (Hair, Black, Babin, Anderson, & Tatham, 2006; Agresti, 1996). This is undesirable since we are modeling a probability that is bounded between zero and one. Further, ordinary least squares regression models assume normally distributed residuals, which is rarely the case for binary response variables. The typical (i.e., canonical) link for logistic regression is the logit, which is the natural log of the odds ratio where the odds ratio is equal to the probability divided by the probability subtracted from one. g( ) log( /(1 )) (2) Combining the random component, systematic component, and link function yields the logistic generalized linear model logit(π) = log 0 X1... X p. (3) 1

5 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 5 Since natural logs can be reversed through exponentiation, and since odds can be converted to probabilities, the fitted equation can be used to predict probabilities via or equivalently P i i ip p 1 (4) p 1 Pr( y 1 x ) {1 exp( x )} exp( X X ) 1 1 p p. (5) 1 [exp( X X )] 1 1 p p Obtaining the required P + 1 estimates of β requires maximum likelihood estimation. An iterative process is used to find the values of the coefficients that are most likely (i.e., maximized) given the data. This is done by iteratively solving a likelihood function. N ni! yi ni yi L( y) i (1 i ). (6) y!( n y )! i 1 i i i First and second derivatives of this function are required in order to determine the maximizing estimates. The calculations are simplified after applying algebraic manipulations to the function. First, we simplify the term with the subtracting exponential (n y) since that is equal to their quotient. Second, we return to the generalized linear model for logistic regression, exponentiate both sides of the equation, and solve for π. The results of exponentiation can be substituted in the left-hand side of the likelihood equation, and the solution to π can be substituted into the right-hand side of the likelihood equation. This simplification contains terms that are powers of powers. Thus, the equation can be simplified further since a number raised to a raised power is equal to that number raised to the product of those powers. Finally, further simplification by applying the

6 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 6 natural logarithm yields a function that readily permits calculation of first and second derivatives. N P P (7) l( ) [ y ( x ) n ][log(1 exp( x )] T hen, differentiation of the first derivative, i ip p i ip p i 1 p 1 p 1 P xip p xip (8) p p 0 which leads to P + 1 equations to be solved for β p by setting them to zero. l( ) N yi xip ni ixip (9) p i 1 This differentiation of the log-likelihood of the parameters is also known as the Score function (U) or the gradient of the log-likelihood. Next, differentiation of the second derivative 2 N l( ) yi xip ni ixip (10) i 1 p p' p' leads to another set of equations to be solved for β. 2 N l( ) ni xip i (1 i ) xip. (11) i 1 p p' During the iterative procedure, a maximum or minimum will be achieved if and when the matrix for the second derivatives contains only negative numbers on its diagonal (i.e., a negative-definite matrix). Note that the equation for the second derivative includes the term, (1 ). This is also the variance for a binomial variable; hence, it is i i not a coincidence that the matrix of second derivatives is also the covariance matrix for the maximum likelihood estimates.

7 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 7 Iteratively solving the equations is a tedious process. Ferron & Hess (2007) provide an excellent didactic example of both maximum likelihood estimation and the iterative procedure for structural equation models. Succinctly, starting values for the estimates are declared and succeeding iterations are based on the first two Taylor polynomial expansions leading to subsequent estimates that are hopefully closer and closer to the maximized solution for the estimates. A problem can arise when maximizing estimates for data that suffers from complete or quasi-complete separation (Albert & Anderson, 1984). This is to say that the probability of y=1 or y=0 is nearly perfectly predictable from a predictor or set of predictors (Webb, Wilson, & Choug, 2004). An extreme example of the separation issue occurs when a 2X2 analysis has one cell containing all 0's or all 1's (Heinze & Ploner, 2003). The problem has been addressed in detail in the fields of medicine (Heinze, 2003) and biometrics (Firth, 1993). However, no research was found that addressed the issue using educational data. Convergence for logistic regression models is affected by the configurations of the observed values in the sample space (So, 1999; Santner & Duffy, 1985; Albert & Albertson, 1984). This is simple to conceptualize by imagining a scatterplot relating y = the probability of being in elementary school and x = subject age (ranging from 5 to 35). There would obviously be two distinct clusters of observations; this is the configuration for this space. Suppose that the resulting maximum likelihood estimate for the age coefficient perfectly predicts being enrolled or not being enrolled in elementary school. In fact, given perfect predictability, the maximum likelihood solution is zero since the fit is perfect. This is complete separation, and it would be apparent in the data after sorting

8 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 8 by age. A scatterplot would draw a line between the two clusters; no member belonging to one cluster would appear either on the line or within the other cluster. A more realistic scenario is one that distinguishes between borderline observations. Due to month of birth differences, there is presumably an age at which only some observations will be classified as enrolled in elementary school. The scatterplot would still neatly divide the clusters; however, some observations would fall on the line itself. In this case, a maximum likelihood estimate exists and is usually very close to zero. However, the dispersion matrix will usually take on unrealistically large or small values. In other words, without prior knowledge regarding reasonable estimates of coefficients and variances, the successful convergence may create the illusion of accurate results leading to incorrect inferences and interpretations leading to what we term quasinonconvergence. If you have an idea of what the coefficients and/or variances should be then the problem can be identified since the provided estimates may be absurd. Variances tend to be extremely inflated (Webb, Wilson, & Choug, 2004), and odds ratios are often infinite (Heinze, 2006). For example, SAS will provide an odds ratio estimate of > Program logs and/or output sometimes provide a clue although there is much variability in how the clue may be presented (Zorn, 2005). SAS reports Quasi-complete separation of data points detected. Warning: the maximum likelihood estimate may not exist. Warning: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood iteration. Validity of the model fit is questionable. SPSS reports, Estimation terminated at iteration because maximum iterations have been reached. Final solution cannot be found.

9 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 9 The R program reports, fitted probabilities numerically 0 or 1 occurred in:, which is not to say that the observed responses for a variable were all 0 s or 1 s. Stata actually eliminates variables under complete separation to produce a solution and fails to provide any estimates under quasi-complete separation. Researchers encountering complete or quasi-complete separation tend to take arbitrary steps to eliminate the problem. The simplest solution is to delete the predictor that is responsible for the separation. This assumes that the problem is due to a particular predictor and not a linear combination (Heinze, 2006). Some researchers insert artificial data to eliminate separation. Another possibility is to use exact logistic regression as proposed by Cox (1970); however, this procedure leads to degenerate estimates when some or all of the predictors are continuous (Heinze & Schemper, 2002). Zorn (2005) presents many examples of published research that have attempted to resolve separation issues using such procedures; the procedures have been reiterated by others (Heinze, 2006; Heinze & Schemper, 2002). A more appropriate procedure, penalized maximum likelihood, is more recent. The procedure is based on unrelated research by Firth (1993) and was applied to resolving separation issues in logistic regression by Heinze and Ploner (2003). It has long been known that maximum likelihood logistic regression estimates are biased. Firth (1993) refers to the findings of 3.4% asymptotic bias away from the true value (Copas, 1988). Hence Firth (1993) proceeded to construct a penalized likelihood correction. This correction is a shrinkage quantity intended to remove the bias. The correction, known in the Bayesian literature as the Jeffrey s invariant prior (Zorn, 2005), adds one-half of the natural logarithm of the information matrix to the log-likelihood

10 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 10 while concurrently adjusting the Score function. As the log-likelihood declines to zero over iterations, the adjustment serves to counteract the bias. The resulting penalized likelihood, log-likelihood, and score equations are displayed below. * ( ) ( ) I (12) * ln ( ) ln ( ).5ln I (13) * 1 I ( ) U( ) U( ).5 tr I( ) (14) Heinze and Ploner (2003) found that this correction also eliminates what they term the nonconvergence bug due to separation issues. They noted that the correction will always provide a finite solution in the presence of separation. Heinze and Schemper (2002) found 97.5% coverage for the confidence intervals as well as high power. Zorn (2005) noted that because they are shrunken towards zero, penalized-likelihood estimates will typically be smaller in absolute value than standard MLEs, though their standard errors will also be reduced, yielding similar inferences about the significance of parameter estimates for those parameters whose MLE is finite (2005, p. 160). This is not to say that penalized maximum likelihood is a panacea to separation issues. Caution is advised when constructing confidence intervals for the odds ratio. The penalized estimates are permitted to be asymmetrical leading to inappropriate coverage for traditional Wald intervals. In this case, the analyst should construct profile intervals for the penalized likelihood estimates (Heinze, 2006; Heinze & Ploner, 2003; Zorn, 2005). Another caution is to not ignore other modeling problems; for example, penalized likelihood does not serve to resolve issues of multicollinearity (Heinze & Schemper, 2002).

11 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 11 The primary difference between Firth s correction and those of others (Rubin & Shenker, 1987; Clogg, Rubin, Schenker, Schultz, & Weidman, 1991) is that the other procedures are not iterative hence failing to leverage over the course of maximization. Heinze and Ploner (2004) wrote SAS, S-PLUS, and R macros to implement the penalized maximum likelihood estimation; the code is freely available at This code also produces figures to help determine whether the Wald confidence intervals or profile-based intervals are more appropriate. Applied Example The data for this example were extracted from the Internet Access in U.S. Public Schools survey of Fall, 2005 (NCES, 2005) collected in the Fall of The survey was developed to measure the extent of Internet usage and aspects related to Internet usage such as the type of connection, control of content access, and integration into curriculum. Responses were obtained from 1,104 school technology coordinators (or staff members most knowledgeable about school Internet access). For the purposes of this example, listwise deletion was used to remove all observations with missing data on one or more of the variables used as predictors or responses. This resulted in an analyzed sample size of 822. Further listwise deletions were made in order to produce a perfect quasi-complete separation. This reduced the sample size to 814. The binary response variable was item Q7DA: Students with Disabilities (0 = No Access to Internet, 1 = Access to Internet). There were three binary predictors as follows: NS_LEVEL: School (0 = Secondary, 1 = Elementary), Q9AB: What technologies or other procedures does your school use to prevent student access to inappropriate material

12 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 12 on the Internet? Intranet (0 = No, 1 = Yes), and Q10: Does your school allow students to access instructional computers with Internet access at times other than regular school hours? (0 = no, 1 = yes). In addition, there were three continuous predictors as follows: Q4A: How many computers in your school currently have Internet access, PCTMIN: percent minorities enrollment, and FLEP: percent of students eligible or free or reducedprice school lunch. The adjusted variable was NSLEVEL such that the probability of a student having Internet access was perfectly predictable for elementary schools but not for secondary schools. In other words, for elementary schools, all students, with disabilities had Internet access; for secondary schools, not all students with disabilities had Internet access. If perfect prediction was obtained for elementary schools then the condition would be complete, not quasi-complete, separation. Logistic regression was used to analyze these data using SAS PROC LOGISTIC. Table 1 displays partial SAS output displaying the output warning of quasi-complete separation. However, the maximum likelihood does appear to exist. Further, the maximum likelihood estimates appear to be reasonable until one inspects the standard errors and notices the school-level (NSLEVEL) estimate of 100.8, which is also the standard error for the intercept. These estimates are displayed in Table 2. The problem is transparent when inspecting the maximum likelihood estimates for the odds ratios. The school-level variable has a reported odds ratio estimate of < with a 95% Wald confidence interval of (<0.001, ). Given the warning and the unreasonable odds ratio estimate, we conclude quasi-nonconvergence for this solution. These results are displayed in Table 3.

13 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 13 The data were analyzed again using the penalized log-likelihood macro in SAS. This time there were no nonconvergence warnings. As seen in Table 4, the standard errors for the intercept and school-level (NSLEVEL) variable are more reasonable. Table 5 suggests that, for school-level, the odds of students with disabilities having Internet access are times higher for elementary schools than for high schools. This is a considerably more reasonable estimate than the previously found estimate of <.001. It is also interesting to note that maximum likelihood produced a p-value for school-level of 0.929, which strongly suggests lack of support for this variable as a predictor. However, the penalized maximum likelihood results indicate that this variable has an extremely low p-value of This tells us that the quasi-nonconvergence should not be ignored. Conclusion The primary purpose of this article is to inform applied researchers about a technique for handling a common problem that occurs when analyzing response variables that are binary. Logistic regression was reviewed from the perspective of generalized linear models followed by an extensive treatment of maximum likelihood estimation. Not all logistic regression models truly converge. Quasi-nonconvergence can arise due to complete or quasi-complete separation of data in the configuration space. Much previous research has addressed the problem by eliminating the culprit variables, making transformations of the data, and even inserting artificial data. It is also suspect that some research has not addressed the problem at all since statistics software does not provide a clear indication of the problem or how to handle it.

14 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 14 Firth (1993) proposed penalized likelihood as a mechanism for correcting bias inherent in logistic regression. Heinze and Ploner (2003) extended penalized likelihood estimation to resolve nonconvergence in logistic regression due to separation issues. Their research demonstrated adequate coverage and power for these odds ratio estimates that always have a finite solution. It is intended that researchers will be more aware of the potential for quasinonconvergence in logistic regression that may go unnoticed. A converging maximum likelihood estimation does not guarantee a converging parameter estimate. The consequence of reporting such results may be incorrect or even nonsensical. Researchers are encouraged to obtain the macro for penalized maximum likelihood and apply it when faced with complete or quasi-complete data separation.

15 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 15 References Agresti, A. (1996). An introduction to categorical data analysis. New York: John Wiley and Sons. Albert, A., & Anderson, J. A. (1984). On the existence of maximum likelihood estimates in logistic regression models, Biometrika, 71, Clogg, C. C., Rubin, D. B., Schenker, N., Schultz, B., and Weidman, L. (1991). Multiple imputation of industry and occupation codes in censu public-use samples using Bayesian logistic regression. Journal of the American Statistical Association, 86, Copas, J. B. (1988). Binary regression models for contaminated data (with discussion). Journal of the Royal Statistical Sociate: Series B, 50, Cox, D. (1970). Analysis of Binary Data. New York: Wiley. Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika, 80, Hair, J. F., Black, W. C., Babin, B. J., Anderson, R. E., & Tatham, R. L. (2006). Multivariate Data Analysis. (6 th ed.). Upper Saddle River, NJ: Perason Education. Heinze, G. (2006). A comparitive investigation of methods for logistic regression with separated or nearly separated data. Statistics in Medicine, 25, Heinze, G., & Ploner, M. (2004). A SAS macro, S-PLUS library, and R package to perform logistic regression without convergence problems. Technical Report 2/2004. Medical University of Vienna: Department of Computer Sciences Section of Clinical Biometrics, Vienna. Heinze, G., & Ploner, M. (2003). Fixing the nonconvergence bug in logistic regression

16 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 16 with SPLUS and SAS. Computer Methods and Programs in Biomedicine, 71, Heinze, G., & Schemper, M. (2002). A solution to the problem of separation in logistic regression. Statistics in Medicine, 21, Kimberly, L. (2007). Who s skipping school: Characteristics of truants in 8 th and 10 th grade. Journal of School Health, 77, McCullagh, P., &J.A. Nelder. (1989). Generalized Linear Models. Chapman and Hall: London. Nelder, J. A. and Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society: Seris A, 135, Santner, T. J., & Duffy, E. D. (1986). A note on A. Albert and J. A. Anderson s conditions for the existence of maximum likelhood estimates in logistic regression models. Biometrika, 73, Rubin, D. B., & Schenker, N. (1987). Logit-based interval estimation for binomial data using the Jeffrey prior. In Sociological Methodology 1987, Ed. C. C. Clogg, pp Washington, DC: American Sociological Association. So, Y. (1999). A tutorial on logistic regression. SAS Technical Report. Cary, NC. Suh, S., Suh, J., & Houston, I. (2007). Predictors of categorical at-risk high school dropouts. Journal of Counseling & Development, 85, Webb, M. C., Wilson, J. R., & Chong, J. (2004). An analysis of quasi-complete binary data with logistic models: Applications to alcohol abuse data. Journal of Data Science, 2, Wohlgemuth, D., Whalen, D., Sullivan, J., Nading, C., Mack, S., & Wang, Y. (2007).

17 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 17 Financial, academic, and environmental influences on the retention and graduation of students. Journal of College Student Retention: Research, Theory & Practice, 8, Zorn, C., (2005). A solution to separation in binary response models. Political Analysis, 13,

18 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 18 Table 1 SAS Output Displaying Warning of Quasi-Complete Separation Model Convergence Status Quasi-complete separation of data points detected Warning: The maximum likelihood estimate may not exist. Warning: The LOGISTIC procedure continues despite of the above warning. Results shown are based on the last maximum likelihood estimation The LOGISTIC procedure Warning: The validity of the model fit is questionable.

19 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 19 Table 2 SAS Output Displaying ML Logit Estimates Parameter DF Estimate Error Chi- Pr > ChiSq Square Intercept NSLEVEL Q9AB Q Q4A PCTMIN FLEP

20 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 20 Table 3 SAS Output Displaying ML Odds Ratio Estimates Effect Point Estimate 95% Wald Confidence Limits NSLEVEL 0 vs. 1 <0.001 <0.001 > Q9AB 0 vs Q10 0 vs Q4A PCTMIN FLEP

21 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 21 Table 4 SAS Output Displaying Penalized ML Logit Estimates FL estimates and Wald confidence limits and tests NOTE: Confidence interval for Intercept based on Wald method. Paramater Standard Lower Upper Pr> Variable Estimate Error 95% c. l. 95% c. l. Chi-Square INTERCEP <.0001 NSLEVEL Q9AB Q4A PCTMIN FLEP

22 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 22 Table 5 SAS Output Displaying Penalized ML Odds Ratio Estimates Odds Lower Upper Pr> Variable Estimate 95% c. l. 95% c. l. Chi-Square NSLEVEL Q9AB Q4A PCTMIN FLEP

Penalized likelihood logistic regression with rare events

Penalized likelihood logistic regression with rare events Georg Heinze 1, Angelika Geroldinger 1, Rainer Puhr 2, Mariana Nold 3, Lara Lusa 4 1 Medical University of Vienna, CeMSIIS, Section for Clinical