MULTINOMIAL LOGISTIC REGRESSION

Size: px

Start display at page:

Download "MULTINOMIAL LOGISTIC REGRESSION"

Christian Flowers
5 years ago
Views:

1 MULTINOMIAL LOGISTIC REGRESSION Model graphically: Variable Y is a dependent variable, variables X, Z, W are called regressors. Multinomial logistic regression is a generalization of the binary logistic regression. Model Multinomial logistic regression is realized as many binary logistic regressions. One category is chosen as a reference category and, for each of the remaining categories, the binary logistic regression model is constructed. For example, let Y have four different values (4 categories), that is let Y = 1, 2, 3, 4. Let us assume that the reference category is Y=4 and let variables X, Z, W be regressors of the model. Then we construct three models of the binary logistic regression: P(Y = 1) P(Y = 4) = exp {C 1 + b 11 X + b 12 Z + b 13 W}, P(Y = 2) P(Y = 4) = exp {C 2 + b 21 X + b 22 Z + b 23 W}, P(Y = 3) P(Y = 4) = exp {C 3 + b 31 X + b 32 Z + b 33 W}. Coefficients C 1, b 11, b 12, b 13, C 2, b 21, b 22, b 23, C 3, b 31, b 32, b 33 are unknown. Their estimates C 1, b 11, b 12, b 13 are obtained from data. Let us assume that Z and W are fixed, then If b 11 > 0, then as X grows, it becomes more probable that Y = 1 than Y = 4. If b 11 < 0, then as X grows, it becomes more probable that Y = 4 than Y = 1. 1

2 Similarly, If b 11 b 21 > 0 then as X grows, it becomes more probable that Y = 1 than Y = 2. If b 11 b 21 < 0, then as X grows, it becomes more probable that Y = 2 than Y = 1. Forecasting For concrete values of X,Z, W we calculate z 1 = C 1 + b 11 X + b 12 Z + b 13 W, z 2 = C 2 + b 21 X + b 22 Z + b 23 W, z 3 = C 3 + b 31 X + b 32 Z + b 33 W And e z 1 P (Y = 1) = 1 + e z 1 + e z 2 + e z, P (Y = 2) = e z 1 + e z 2 + e z, 3 e z 3 P (Y = 3) = 1 + e z 1 + e z 2 + e z, P (Y 1 = 4) = e z 1 + e z 2 + e z. 3 e z 2 Here e = 2, We forecast value with the largest probability. Odds ratio For each category we can calculate odds ratio that is how much odds change if regressor grows by additional point: Number e b 11 W are fixed): is odds ratio for P (Y=1)/ P (Y=3) when X grows by one point (Z and ( P (Y = 1) P (Y = 3) ) new = e b 11 ( P (Y = 1) P (Y = 3) ) old 2

3 Data Dependent variable Y is categorical, regressors are interval or categorical variables. Interval regressors do not correlate strongly. For each category of Y we have sufficient amount of data. Main steps 1) Check for the maximum likelihood statistics p-value. Check if p < 0,05. If not, then the model is unacceptable. 2) Check if all regressors are statistically significant (all p < 0,05). Drop insignificant regressors from the model. 3) Check classification table. If there are many false classifications, the model is unacceptable. 4) Additionally one can check if for the deviance test s p 0,05 (for models with few interval regressors) For a good model: Maximum likelihood statistic s p < 0,05. Maximum likelihood p < 0,05 for all regressors. Wold s p < 0,05 for the majority of regressors in all sub models. Percent of correctly classified observations for each category is larger than a percent of data belonging to that category. For all data Cook s distance 1. The chosen pseudo R square (coefficient of determination) 0,20. 3

4 Multinomial logistic regression with SPSS 1. Data File ESS4CZ_IL_SE, variables: cntry (CZ Czech Republic, IL Israel, SE Sweden), stfedu satisfaction with countries educational system from 0 very bad to 10 perfect; imsclbn when immigrants can use social benefits: 1 at once, 2 after one year, 3 after one year of working and paying taxes, 4 after becoming citizens, 5 never); imigrantbf which is 0, for imsclbn <=3, and 1 for imsclbn >=4. trstprl trust in parliament ( 0 very low,... 9 very high); pray praying ( 1 every day,..., 7 never); hhmmb number of the remaining members of the household. We want to check if regressors can help to distinguish among different countries. We investigate males years of age. Let Select Cases -> : agea <= 30 & agea >= 20 & gndr = SPSS options Schematically our model is cntry = f (imigrantbf, pray, stfedu, trstprl, hhmmb) Analyze ->s Regression Multinomial Logistic. 4

5 Put cntry into Dependent. Interval variables pray, stfedu, trstprl, hhmmb go into Covariates. Categorical variable imigrantbf goes to Factor(s). Reference Category can be used if we want other reference category. Statistics -> additionally check Goodness-of-fit and Classification table. 5

6 3. Results Table Case Processing Summary gives information about the number of respondents in each category. There should not be one dominant category. Case Processing Summary N Marginal Percentage cntry Country CZ Czech Republic % IL Israel % SE Sweden % imigrantbf Social benefits for.00 positive attitude % immigrants 1.00 negative attitude % Valid % Missing 71 Total 592 Subpopulation 461 a Table Case Processing Summary is also used for comparison with the classification table. If Czechs comprise 30.3% of all respondents, then applying the multinomial logistic procedure we should classify correctly larger percent of Czechs. Classification table is at the very end of the output: Classification Predicted Observed CZ Czech Republic IL Israel SE Sweden Percent Correct CZ Czech Republic % IL Israel % SE Sweden % Overall Percentage 30.9% 42.0% 27.1% 73.7% % Swedes. We see that our model helps to classify correctly 70,3 % of Czechs, 80,1 % Israelites and 67,6 Table Pseudo R-Square contains pseudo R-Squares. We can choose any-one of them and be happy that it is larger than For example, Nagelkerke R-Square =

7 Pseudo R-Square Cox and Snell.575 Nagelkerke.650 McFadden.397 Maximum likelihood statistic and it s p-value for the whole model is given in Model Fitting Information. Since p<0.05 we conclude that there is statistically significant overall model fit to data. Model Fitting Information Model Fitting Criteria Likelihood Ratio Tests Model -2 Log Likelihood Chi-Square df Sig. Intercept Only Final Statistical significance of the regressors for the whole model can be found in the table Likelihood Ratio Tests. If p<0.05, then regressor is statistically significant. If p>=0.05, then regressor is not significant and should be dropped from the model and all analysis should be repeated. In our case all regressors are statistically significant. Likelihood Ratio Tests Effect Model Fitting Criteria -2 Log Likelihood of Reduced Model Chi- Likelihood Ratio Tests Square df Sig. Intercept pray stfedu trstprl hhmmb imigrantbf Table Parameter Estimates gives information about all sub models. Observe that in some sub models various regressors are not significant. For example, pray is significant when distinguishing between Israel and Czech Republic and insignificant when distinguishing between Czech Republic and Sweden.. 7

8 Parameter Estimates 95 % Confidence Interval for Exp(B) Std. Lower Upper cntry Country a B Error Wald df Sig. Exp(B) Bound Bound CZ Czech Republic Intercept pray stfedu trstprl hhmmb [imigrantbf=.00] [imigrantbf=1.00] 0 b IL Israel Intercept pray stfedu trstprl hhmmb [imigrantbf=.00] [imigrantbf=1.00] 0 b a. The reference category is: SE Sweden. b. This parameter is set to zero because it is redundant. Each sub model has its own model equation. For example, Here P (cntry = CZ) P (cntry = SE) = ez 1. z 1 = 0, ,011pray + 0,299stfedu 0,487trstprl + 0,216hhmmb 0, jei imigrantbf = 1, + { 1,212, jei imigrantbf = 0. Positive coefficients to stfedu and hhmmb mean that respondent, who is more satisfied with countries educational institutions and lives in larger families, is more likely from Czech Republic than from Sweden. Similarly, respondent who is more satisfied with parliament is more likely from Sweden. Finally, if respondent has more positive attitude toward social benefits for immigrants (imigrantbf = 0), it is more probable that he is from Sweden. Similarly, 8

9 P (cntry = IL) P (cntry = SE) = ez 2, Here z 2 = 5,077 0,573pray 0,379stfedu 0,25trstprl + 0,704hhmmb 0, jei imigrantbf = 1, + { 0,821, jei imigrantbf = 0. All interpretations of the coefficient signs are similar to those in above. 4. Forecasting Let pray = 2, stfedu = 4, trstprl = 5, hhmb = 4, imigrantbf = 1. Then And And P (cntry = SE) = z 1 = 0, , , , , = 0,149, z 2 = 5,077 0, , , , = 3,981. Most likely this respondent is from Israel. e z 1 = 1,160, e z 2 = 53,570, e z 1 + e z 2 = 54, ,73 = 0,01794, P (cntry = CZ) = 1, ,73 = 0,0208 P (cntry = IL) = 53, ,73 = 0, Interaction of variables If we suspect interaction of variables, we can add products of variables into model. For example, if we think that cntry = f (imigrantbf, pray, stfedu, trstprl, hhmmb, stfedu*trstprl) then we, in addition, choose Model -> check Custom Stepwise -> and put all variables into Forced Entry Terms. Product of stfedu and trstprl appears if we put both variables at once. 9

10 In the output we see that this interaction is not statistically significant. Likelihood Ratio Tests Model Fitting Criteria Likelihood Ratio Tests Effect -2 Log Likelihood of Reduced Model Chi-Square df Sig. Intercept a hhmmb imigrantbf pray stfedu trstprl stfedu * trstprl

1. BINARY LOGISTIC REGRESSION

1. BINARY LOGISTIC REGRESSION The Model We are modelling two-valued variable Y. Model s scheme Variable Y is the dependent variable, X, Z, W are independent variables (regressors). Typically Y values are