Ron Heck, Fall 2011 1 EDEP 768E: Seminar in Multilevel Modeling rev. January 3, 2012 (see footnote) Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011) The generalized linear model (GLM) represents and extension of the linear model for investigating outcomes that are categorical (e.g., dichotomous, ordinal, multinomial, counts). The model has three components that are important to consider: A random component that specifies the variance in terms of the mean ( ), or expected value, of Y for individual i in group j; A link function, often represented as g(), which converts the expected values ( ) of Y to transformed predicted values of [ f ( ) ]. The Greek letter (eta) is typically used to denote the transformed linear predictor. The link function therefore provides the relationship between the linear predictor and the mean of the distribution function; and A structural component that relates the transformed predictor of Y to a set of predictors. One type of GLM that can be applied in various situations is the logistic regression model. It can be applied to dichotomous dependent variables (e.g., pass/fail, stay/leave) and can also be extended to consider a dependent variable with several categories (e.g., manager, clerical, custodian). The categories may be ordered (as in ordinal regression) or unordered. In this latter case (called multinomial logistic regression), each category is compared to a reference category (i.e., the last category is the default in SPSS). Logistic regression can also readily accommodate interactions among predictors (e.g., gender x ethnicity). It is important to note that extremely high correlations between predictors may lead to multicollinearity. One fix is to eliminate one of the variables in multicollinear pairs from the model. The Model Ordinary least squares regression cannot be correctly applied to data where the outcome is categorical (e.g., two categories coded 0 and 1) or ordinal (i.e., with a restricted range). In the case of a dichotomous dependent variable (Y), for example, linear models with nonzero slopes will eventually predict values of Y greater than 1 or less than 0. The problem is that when an outcome can only be a 0 or a 1, it will not be well described by a line since there is a limit on values that describe someone s outcomes (0 = dropout, 1 = persist). Predictions will break down 1 This is a post-class revision of the original document released on October 5, 2011 (followed by a revised version dated October 20, 2011). The post-class revision was released/uploaded on January 3, 2012.
(EDEP 768) Week 8: Introducing General Linear Models Logistic Regression (rev.) 2 at the boundaries (0 or 1). Let s suppose we use students socioeconomic status (SES) to predict their likelihood to persist versus drop out. Any linear model with a nonzero slope for the effect of SES on probability of persisting will eventually predict values of Y greater than 1 or less than 0 (e.g., 1.5 or -0.5), yet these are impossible predictions since by definition the values for Y can only be 0 or 1. Moreover, for dichotomous outcomes, the distribution of the errors from predicting whether someone persists or not will be nonrandom (since they can only have two values) and, therefore, they cannot normally distributed (i.e., violating a basic assumption of the linear model). Contrast this with a linear relationship, for example, predicting someone s math test score, from their family income, where one family income might be $25,000 while another has a family income of $250,000. The regression line will extend along the Y axis and X axis well beyond the limits of 0 and 1, and the prediction errors will also tend to be random and normally distributed. Because of these problems, we need a nonlinear model that can transform predicted values of Y to lie within the boundaries of 0 and 1 (Hamilton, 1992). Logistic regression emphasizes the probability of a particular outcome occurring, given an individual s pattern of responses to a set of independent variables. This relationship between independent variables and categories is nonlinear (i.e., not constant across levels of the independent variable) with respect to the probabilities of an outcome being either 0 or 1. The logic transformation is typically a type of S- curve. We might standardize income (X) such as in the following graph. Figure 1. S-curve graph Logit Coefficient While the logistic regression model is nonlinear for probabilities (due to properties of its variation between 0 and 1), it is linear with respect to the logit coefficients. The probability of an
(EDEP 768) Week 8: Introducing General Linear Models Logistic Regression (rev.) 3 event occurring (e.g., π = 1) can then be written as follows: PY ( 1). 1 By taking the natural logarithm of the odds of the event occurring versus not occurring, we obtain a logit coefficient (η). The logit coefficient then can be predicted as a linear function of the following: η = log(π/(1-π)) = β 0 + β 1 X 1 + β 2 X 2 +... β 2 X p If we take the natural log of the probability π = 1/(1- π), we can interpret the effect of a predictor as the change in the log odds of η associated with a one-unit change in the independent variable. So if we had a simple model predicting the log odds of persisting using only female, the equation might look like this: η = 0.06 + 0.76 (female) The coefficient for female is 0.76, which suggests that as gender changes from 0 to 1 (i.e., male coded 0 to female coded 1), the log odds of persisting increase by 0.76 units. For a man, the log odds of persisting would then be 0.06 [0.06 + 0.76(0)]. Notice also that there is no error term for a logistic model. This is because in the binomial probability distribution the variance is related to the mean proportion, so it cannot be estimated separately. We can use the inverse of the logit link function to obtain the probability of persisting 1. 1 e In this case, the probability of persisting for a man would be: 1/[1 + e -(0.06) ] = 0.51, where e is value of the natural log, which is approximately 2.71828. Since the estimate for a male s probability of persisting is just about 0.50 (or half the time), our knowledge that an individual is male is not much better than flipping a coin in terms of knowing whether or not a particular male individual will persist. If the estimate were considerably below 0.5, then we would predict that the individual would be unlikely to persist. For a female, however, the log odds of persisting are much better. The estimated log odds would be 0.82 [0.06 + 0.76(1)]. The probability of persisting would then be
(EDEP 768) Week 8: Introducing General Linear Models Logistic Regression (rev.) 4 1/[1 + e -(0.82) ] = 0.69 So, since the estimate is considerably above 0.5, we would predict that a given female is likely to persist. Odds Ratio People generally interpret the odds ratio (i.e., the increase or decrease in the odds of being in one outcome category when the predictor increases by one unit). Consider the case where we flip a coin. The ratio of the odds of it coming up heads or tails is the same..50 ------ = 1.50 Therefore, when the events are equally likely to occur we say the ratio of their odds is equal to 1 (or equally likely). The odds ratio (e β ) can be expressed as follows: P( ) P(1 ) = 0 1 X 1 2 X 2... qxq e This ought to look somewhat similar to the log odds equation. The odds ratio for a particular predictor variable is defined as e β, where β is the logit coefficient estimate for the predictor and e is the natural log. If β is zero, the odds ratio will equal 1 (i.e., since any number to the 0 power is 1), which leaves the odds unchanged. If β is positive, the odds ratio will be greater than 1, which means the odds are increased. If β is negative, the coefficient will be less than 1, which means the odds are decreased. Consider the case previous case where the relationship of female (coded male = 0, female = 1) to persisting to graduation is examined. If the odds ratio [also abbreviated as Exp(β)] is 2.5, it means as gender changes from male to female (i.e., a unit change in the independent variable), the odds of persisting versus dropping out are multiplied by a factor of 2.5. In contrast, consider the case where the odds ratio is 0.40. Suppose we reversed the coding of gender (females = 0, males = 1), then it would imply that the odds of persisting diminish for males by a factor of 0.40 compared with females. These two odds ratios can be shown to be equivalent by division (1/0.40 = 2.5 and 1/2.5 = 0.40). Model Fitting Logistic models are commonly estimated with maximum likelihood rather than ordinary least squares (as in multiple regression). Estimates are sought that yield the highest possible values for the likelihood function. The equations are nonlinear in that the parameters cannot be solved directly (as in OLS regression). Instead, an iterative procedure is used, where successively better
(EDEP 768) Week 8: Introducing General Linear Models Logistic Regression (rev.) 5 approximates are found that satisfy the log likelihood function (Hamilton, 1992). Nested Models Successive logistic regression models can be evaluated against each other. Successive models (called nested models when all of the elements of the smaller model are also in the bigger model) can be evaluated in logistic regression by comparing them against a baseline model using the change in their log likelihoods. We compare the deviance (or -2*log likelihood) of each model. 2 The difference in models is distributed as with degrees of freedom equal to the difference in the number of parameters estimated between the two models. For example, if one is evaluating a model with three predictors and chooses to add another predictor, the change in model fit can be assessed. For one degree of freedom, a significant improvement (at p =.05) in model fit would result from a chi-square change of 3.84. The likelihood ratio test (G 2 ) is defined as the following: G 2 = 2[(log-likelihood for bigger model)- (log likelihood for smaller model)] So for example, if the log-likelihood of the smaller (3 predictors) is -10.095 and for the bigger model (4 predictors) is -8.740, the equation looks like this: G 2 = 2[(-8.470) - (-10.095)] = 2.71 Because the required chi square for 1df is 3.84, we would generally conclude that the model with three predictors is preferred over the other model, since adding the fourth predictor did not improve the model s fit significantly. Common Problems A number of problems can occur if there are too few cases relative to the number of predictors in the model. It is likely that the model will produce some very large parameter estimates and large standard errors. This can be a result of having combinations of variables that produce too many cells with no cases in the cells. The model may also fail to converge on a solution. Sometimes this can be fixed by collapsing the categories (i.e., making fewer categories) or by eliminating independent variables that are not important. A Two-Level GLM Model The single-level model can easily be extended to a mixed model (GLMM). We start here by introducing the basic specification for a two-level model. For a two-level model, we include subscript i to represent an individual nested within a level-2 unit designated by subscript j. The level-1 model for individual i nested in group j is of the general form:
(EDEP 768) Week 8: Introducing General Linear Models Logistic Regression (rev.) 6 x, where x is a (p + 1) x 1 vector of predictors for the linear predictor of Y and is a vector of corresponding regression coefficients. Notice there is no error term at level 1. An appropriate link function is then used to link the expected value of Y to. In this case, we will use the logistic link function. log X X X 1 0 j 1 1 2 2 q q At level 2, the level-1 coefficients qj can become an outcome variable. Following Raudenbush et al. (2004), a generic structural model can be denoted as follows: W W W u, qj q0 q1 1j q2 2 j qs Sqj qj where qs (q = 0,1,,S q ) are the level-2 coefficients, W q q Sj. are level-2 predictors, and uqj are level-2 random effects. You can see that the level-2 model (and successive levels) are specified the same as a model with continuous outcomes. It is only the level-1 model that is different. We use the GENLIN MIXED program (starting in Version 19) to build an example two-level model to examine students proficiency in reading. In samples where individuals are clustered in higher order social groupings (e.g., a department, a school, or some other type of organization), simple random sampling does not hold because individuals clustered in groups will tend to be similar in various ways. For example, they attend schools with particular student compositions, expectations for student academic achievement, and curricular organization and instructional strategies. If the clustering of students is ignored, it is likely that bias will be introduced in estimating model parameters. As has been noted previously, where there are clustered data, it is likely that there is a distribution of both intercepts and slopes around their respective average fixed effects. In this situation, we might wish to investigate the random variability in intercepts and slopes across the sample of higher level units in the study. Once we determine that variation exists in the parameters of interest across groups, we can build lelvel-2 models to explain this variation. In some cases, we may have a specific theoretical model in mind that we wish to test, while in other cases, we might be trying to explore possible new mechanisms that explain this observed variation in parameters across groups. We have a data set with 7,009 high school students in 988 schools. We wish to determine what might affect their likelihood to be proficient in math. Within schools we have background
(EDEP 768) Week 8: Introducing General Linear Models Logistic Regression (rev.) 7 variables associated with student SES, grade point average, and whether they were in a primarily college prep curricular program or a more advanced high school program. Between schools, we have a student composition variable and a variable describing the academic focus of the school. No Predictors Model We can begin with a no predictors model. At level 1, the unconditional model to relate the transformed predicted values to an intercept parameter is defined as follows: log 1 0 j. We note again there is no separate level-1 residual variance term for models with categorical outcomes. The level-2 model will simply be the following: u. 0 j 00 0 j Through substitution, the combined single equation is the following: u, 00 0 j which suggests there are two parameters to be estimated (the intercept and random level-2 effect). Here we can see the estimated log odds of being proficient is 0.684. Table 1: Fixed Effects Model Term Coefficients Std. Error T Sig. Exp(Coefficient) 95% Confidence Interval for Exp(Coefficient) Lower Upper Intercept 0.684 0.034 20.328.000 1.982 1.856 2.118 Probability distribution: Binomial; Link function: Logit Notice the look of the tables is a bit different in the GENLIN MIXED program (i.e., it requires using the computer mouse to open up each part of the output and then converting heat maps to tables). I have used table template I created to present the relevant output. Table 2: mathprof2 We can see in the above table that most students are likely to be proficient in math. The percentage of proficient students at the individual level is about 66.4%.
(EDEP 768) Week 8: Introducing General Linear Models Logistic Regression (rev.) 8
(EDEP 768) Week 8: Introducing General Linear Models Logistic Regression (rev.) 9 We can calculate the odds of being proficient from the fixed effects table above as the following: 1 1 = 1 e (.684) 1 e The resulting proficiency level averaged across schools is about 0.67 [1/(1 + 0.501) = 0.67]. This is a little different since it is the average unit estimated proficiency level, rather than the average for the population. If we look at the odds ratio (1.98), we can say students are about 2:1 more likely to be proficient than not proficient. If we made this a proportion, it would be something like 0.667/0.333 = 2. Variance Components The level 2 variability suggests that the probability varies significantly across schools (Z = 8.144, p <.01). Table 3: Variance Components Random and Residual Effects Estimate Std. Error Z-test Significance 95% Confidence Interval for Exp(Coefficient) Lower Upper Random Var(Intercept) a 0.386 0.047 8.144.000 0.303 0.490 Residual b 1.00 a Covariance Structure: Variance components; Subject specification: schcode b Covariance Structure: Scaled Identity; Subject Specification: none We can notice also that the variability at level 1 (Residual) is scaled to 1.0. This is because the variance at level 1 is tied to the population proportion of individuals who are proficient, so it cannot be estimated separately from the mean. Instead, it is simply scaled to 1.0 to provide a metric for the log odds scale. Despite the scaling to 1.0, an intraclass correlation can be estimated describing the proportion of 2 2 2 variance that lies between units ( ) relative to the total variance (i.e., ). Between The variance of a logistic distribution with scale factor 1.0 is can calculate an ICC ( ) as follows: /( 3.29 ). 2 2 Between Between Within Between Within 2 /3 3.29(Hox, 2002), so we
(EDEP 768) Week 8: Introducing General Linear Models Logistic Regression (rev.) 10 In this case, it will be 0.386/(0.386 + 3.29), or 0.386/3.676, which is 0.105. This suggests about 10.5% of the variance in math proficiency lies between schools. Individual Predictors We might decide to go ahead and build a multilevel model. We can interpret the intercept as the log odds of being proficient when all the other variables are 0. In this case, SES and GPA are standardized (Mean = 0, SD = 1). The intercept is therefore the log odds of being proficient when the student is not in the more advanced curricular program (coded 0). Such a person is about 2.02 times more likely to be proficient as not proficient (Odds Ratio = 2.02, p <.01). Table 4: Fixed Effects Model Term Coefficient Std. Error T Sig. Exp(Coefficient) 95% Confidence Interval for Exp(Coefficient) Lower Upper Intercept 0.704 0.036 19.776.000 2.021 1.885 2.167 ses 0.527 0.040 13.300.000 1.694 1.568 1.831 gpa 0.432 0.031 14.111.000 1.541 1.451 1.636 acprog=1 0.137 0.074 1.859.063 1.147 0.993 1.325 acprog=0 0 a Probability distribution: Binomial; Link function: Logit a This coefficient is set to zero because it is redundant. Increasing student SES by 1-SD (since SES is standardized with mean = 0 and standard deviation =1) would result in an increase in predicted log odds of being proficient of 0.527, other variables being held constant. Alternatively, we can say that the odds of such an individual being proficient increase by a factor of 1.694 compared with individuals at the mean for SES. We can see some support for the view that being in the stronger academic program also increases the log odds of being proficient in math (0.137, p <.07). Adding School Predictors We can see in the following table that when we add the two school-level predictors, only student composition is significantly related to likelihood to be proficient (log odds = 0.541, p <.01). Increasing student SES composition by 1-SD would increase the log odds of being proficient by 0.541 units.
(EDEP 768) Week 8: Introducing General Linear Models Logistic Regression (rev.) 11 Table 5: Fixed Effects Model Term Coefficient Std. Error T Sig. Exp(Coefficient) 95% Confidence Interval for Exp(Coefficient) Lower Upper Intercept 0.711 0.047 15.123.000 2.036 1.856 2.232 Ses 0.360 0.046 7.900.000 1.434 1.311 1.568 Gpa 0.431 0.031 13.918.000 1.538 1.448 1.635 acprog=1 0.158 0.074 2.128.033 1.171 1.013 1.355 acprog=0 0 a acadfocus 0.061 0.256 0.238.812 1.063 0.643 1.755 studentcomp 0.541 0.086 6.294.000 1.717 1.451 2.033 Probability distribution: Binomial; Link function: Logit a This coefficient is set to zero because it is redundant. In terms of odds ratios, increasing composition by 1-SD (since it is also standardized) would increase the odds of being proficient by a factor of 1.717. We should note that because odds ratios are multiplicative rather than additive, if we increased student composition by 2-SD, the resulting odds of persisting would increase by a factor of 2.95. To obtain this new estimated odds ratio, we can first add the log odds (which are the exponents) and then exponentiate the new log odds: e 0.541 + 0.541 = e 1.082 = 2.95 We can also obtain this result by multiplying the odds ratios (1.717*1.717 = 2.95). The odds ratios should not be added, however (1.717 + 1.717 = 3.434). So at 2-SD above the grand mean, the odds of being proficient would be increased by a factor of 2.95, or approximately 3 to 1. We could, of course, extend our analysis to examine a random slope such as the SES-proficiency slope, or perhaps the impact of being in the more advanced academic program might vary across schools. We can also extend the basic two-level model to three-level models. Similarly, we could extend this basic dichotomous cross-sectional model to represent a longitudinal model looking at the likelihood of being proficient at different points in time (e.g., from 9 th through 12 th grades).