ADEL AHMED BABTAIN Department of General Studies, Yanbu University College, Saudi Arabia

Size: px

Start display at page:

Download "ADEL AHMED BABTAIN Department of General Studies, Yanbu University College, Saudi Arabia"

Oswin Palmer
5 years ago
Views:

1 USING BINARY LOGISTIC REGRESSION ND PROBIT ANALYSIS TO MODEL TEACHER ESTIMATES OF TALENTED AND GIFTED STUDENTS CHARACTERISTICS WITH THE IDENTIFICATION RESULTS OF MENTAL ABILITY TEST: A COMPARATIVE STUDY ADEL AHMED BABTAIN Department of General Studies, Yanbu University College, Saudi Arabia adel.babtain@yuc.edu.sa The study aimed to compare binary logistic regression with binary probit analysis in terms of goodness-of-fit, measures of practical significance, prediction ability, and coefficients interpretation. The study relied on correlational methodology and a population of 341 fifth grade male students of Jeddah City educational area who were nominated for gifted programs. The sample was all valid cases which were 292 (86% of the population). The study used two tools to collect data: Renzulli s Scale for Rating Behavioral Characteristics of Superior Students (SRBCSS) and the Saudi National Test of Mental Abilities (SNTMA). The findings showed that binary logistic regression and binary probit analysis display identical results, practical measures of both models are also identical, and RL 2 could be the best measure for practical significance because of its similarity with R2 used in ordinary least squares (OLS) regression analysis. The study showed that coefficient interpretations of the two models are slightly different. Binary logistic regression displayed more ability to interpret coefficient meaningfully compared with binary probit analysis. The study also found that only creativity characteristic is statistically significant to explain and predict the identification of gifted students while other characteristics (learning, motivation and leadership) are statistically insignificant. The study recommended that more comparative investigations involving logistic regression and probit analysis especially with more spread distributions should be carried out. It also recommended that the full scale of teacher estimates of students characteristics of talented and gifted students should be neglected and suggests to exclusively relying on creativity characteristic scale instead. 1 Introduction Linear regression models provide a popular device for organizing data analysis in which researchers focus on the explanation of a dependent variable, Y, as a function of multiple 1

2 independent variables, from X 1 to X k. However, when linear regression is applied with the dichotomous (binary) dependent variable, linearity assumption will be violated and some mathematical transformations should be applied to linearize the relationship between dependent and independent variables (Menard, 2002, p.5). Although with the dichotomous dependent variable, it is possible to code the two values with any numbers, employing values of 1 and 0 has advantages. In such case, the mean of the dummy variable equals the proportion of cases with value 1 and can be interpreted as the probability of having 1 (a specific event or characteristic) (Wright, 1996; Wolfe, 2002; Poston, 2004). 2 Background 2.1 Logit and probit transformation Although many nonlinear functions can represent the S-shaped curve, the logistic (logit) and probit transformations have become popular (Pample, 2000, p.10; Walker, 1998; Aldreich & Nelson, 1984). Given that the dummy dependent variable represents the probability P i of an event (with the dependent variable equaling 1), the logit transformation involves two steps (Guido, Winter & Rains, 2006). First, take the ratio of P i to 1-P i, or the odds of experiencing the event. Then, take the natural logarithm of the odds. The logit thus equals: L =ln ( ) (1) or shortly, the logged odds. This way the logit transformation straightens out the nonlinear relationship between X and the original probabilities of Y (Pample, 2000, p.10). Probit analysis transforms probabilities of an event into scores from the cumulative standard normal distribution rather than into logged odds from the logistic distribution (Pample, 2000, p.54). In a standard normal curve table, the table matches Z scores (theoretically ranging from negative infinity to positive infinity, but in practice from -3 to 3) with a proportion of the area under the curve between the absolute value of the Z score and the mean Z score of 0. With simple calculations, the standard normal table also identifies the proportion of the area from negative infinity to Z score. The proportion of the curve at or below each of the Z scores defines the cumulative standard normal distribution. Since the proportion equals the probability that falls at or below that z score, larger Z scores define greater probabilities in the cumulative standard normal distribution (Pample,2000,p54-55). Conversely, just as any Z score defines a probability in the cumulative standard normal 2

3 distribution, any probability in the cumulative standard normal distribution translates into a Z score. In sums, the cumulative standard normal curve resembles the logistic curve, only with Z scores instead of logged odds along the horizontal axis (Pample, 2000, p.56). 2.2 Practical significance measures Although the dependent variable in logistic regression and probit analysis does not have variance in the same way continuous variables do in OLS regression, maximum likelihood procedures provide model fit measures analogous to those from least squares regression. In logistic regression and probit analysis, the baseline log likelihood (L0) times -2 represents the likelihood of producing the observed data with parameters for the independent variables equaling zero, and corresponding to the total sum of squares. The model log likelihood (LM) times -2 represents the likelihood of producing the observed data with the estimated parameters for the independent variables, and corresponds to the error sum of squares in the OLS regression. The improvement relative to the baseline in the log likelihood model shows the improvement due to the independent variables. Accordingly, these two log likelihoods define an analogy to a proportional reduction in the error measure in regression: 2 R 2 log L0 ) ( 2 log LM )] / ( 2 log ( )) = [( L (2) 0 The numerator shows the reduction in the model error due to the independent variables, and the denominator shows the error without using the independent variables. The resulting value shows the improvement in the log likelihood relative to the baseline. It equals 0 when all the coefficients equal 0, and has a maximum that comes close to 1 when independent variables completely determine and explain the dependent variable. However, the measure does not represent explained variance since log likelihood functions do not deal with variance defined as the sum of squared deviations. This and similar measures are therefore referred to as the pseudo-variance explained or pseudo R 2 (Pample, 2000, p.49). 2.3 Coefficients Interpretation In linearizing the nonlinear relationships, logistic regression shifts the interpretation of coefficients from changes in probabilities to less intuitive changes in logged odds (Dallal, 2001). The loss of interpretability with the logistic coefficients, however, is balanced by the gain in parsimony: the linear relationship with the logged odds can be summarized with a single coefficient, but the nonlinear relationship with the probabilities cannot be so simply summarized (Pample, 2000, p.18; Cizek & Fitzgerald, 1999). 3

4 On the other hand, probit coefficients show the linear and additive change in Z-score units of the probit transformation (i.e., the inverse of the cumulative standard normal distribution) for one-unit change in the independent variables (Liao, 1994, p.21). Perhaps even less intuitive than the logged odds, standard units of the cumulative normal distribution have little interpretive value (Pample, 2000, p.60). 3 Study purpose The purpose of this study is to answer the following research questions: 1. To what extent do binary logistic regression and binary probit analysis fit behavioral characteristics with giftedness identification? 2. What are the measures of practical significance in both techniques? 3. Are there differences in independent variables abilities in logistic regression and probit analysis? 4. How are logistic regression and probit analysis coefficients interpreted? 4 Method 4.1 Population and sample The study population was 5 th grade male students of Jeddah City educational area who were nominated for gifted programs (341 students) and the sample was all the 292 valid cases of the population (86% of the population). 4.2 Instruments The study relied on two tools: Renzulli s Scale for Rating Behavioral Characteristics of Superior Students (SRRBCSS) and Saudi National Test of Mental Abilities (SNTMA). The SRRBCSS version used in this study consists of four dimensions (leadership, creativity, motivation and learning) and was standardized to be valid for Saudi Arabia and Bahrain by Clinton (1988) and Maa jeni et al (1995). The researcher verified the tool validity and reliability by analyzing 50 cases of the sample and the findings were: 4

5 Table 1. Researchers verification of SRRBCSS reliability Dimension Number of items Cronbach s α Leaving Motivation Creativity Leadership The table shows that all dimensions of the scale have reasonable reliability coefficients. Also, the researcher tested the internal consistency by calculating the correlation between each dimension and total score on SRRBCSS. Item Table 2. Internal consistency coefficients for the scale dimensions Creativity Leadership Motivation Learning Correlation Correlation Correlation coefficient with Item coefficient with Item coefficient with Item the total ** the total ** the total ** Correlation coefficient with the total ** ** All correlation coefficients are statistically significant at 0.01 The findings show that all scale dimensions have high internal consistency coefficients. Both findings in table 1 and table 2 prove that SRRBCSS tool has an acceptable level of reliability. The second instrument used in the study was SNTMA developed by Al-Share et al (2001). The test includes 81 items covering four abilities (dimensions): verbal (24items), numerical (26 items), spatial (19 items), and reasoning abilities (18 items). The test was approved by the Saudi Ministry of Education as a formal test to indentify students giftedness across the country. The test shows the following psychometric measures (Al-Share et al, 20, p.25): 1. α coefficient for the four dimensions were ranged from 0.77 to 0.88 and the α coefficient for the entire test was

6 2. Correlation coefficients for each dimension with the achievement scores were from 0.21 to 0.43 and the coefficient of the entire test was The correlation coefficient for each dimension with the adapted Wechsler test (the Saudi version) was as follows: with verbal ability (0.75), with numerical ability (0.63), with reasoning ability (0.53), and with spatial ability (0.57). Also, the correlations with Wechsler test (application part) were 0.59, 0.48, 0.47, and 0.55 respectively and the correlation of the entire test with Wechsler was The construct validity was verified by testing the statistical significance of the differences in SNTMA means across different age categories (from 9 to 16 years). Tests show statistically significant differences among SNTMA means according to ages and an increase according to age increase. Moreover, loadings of all dimensions on verbal and nonverbal (performance) sections of Wechsler test were significant, which means that both tests measure one factor, general intellectual ability (Al-Share et al, 2001, p.30). These findings show that SNTMA has very good psychometric features and valid to be used to identify Saudi gifted students. 5 Results 5.1 The first research question To what extent do binary logistic regression and binary probit analysis fit behavioral characteristics with giftedness identification? To answer this question, the researcher modeled data using both logistic regression and probit analysis and test the null hypothesis indicating that all behavioral characteristics coefficients equal zero and found the following: Table 3. Significance levels for logistic regression and probit analysis models Model* df Significance Logistic regression Probit analysis *each model includes creativity, leadership, motivation and learning characteristics as independent variables and the category of students (gifted or ungifted) as the dependent variable. The table shows that for the logistic regression the probability of obtaining x statistic equals is equal to 0.06, given that the null hypothesis is true (there is no effect of independent variables on the dependent variable). So, the overall logistic regression model is statistically significant because p-value is less than

7 Also, the table shows that the probability of obtaining χ statistic equals in the probit analysis is equal to given that the null hypothesis is true (there is no effect of independent variables on the dependent variable). This means that the overall probit analysis is statistically significant because p-value is less than In sum, both logistic regression and probit analysis that include creativity, leadership, motivation and learning characteristics as independent variables and the category of students (gifted or ungifted) as the dependent variable are statistically significant and at least one of the independent variable coefficients is not equal to zero and contributes in indentifying gifted students. 5.2 The second research question What are the measures of practical significance in both techniques? To answer this question, three practical significance measures were generated as shown in the table below: Table 4. Measures of practical significance of logistic regression and probit analysis Pseudo R 2 Logistic regression Probit analysis McFadden R 2 ( R ) L Cox and Snell ( R ) M Nagelkerke ( R ) N McFadden R 2, Cox and Snell R 2 and Nagelkerke R 2 are attempts to provide a logistic and probit analogy to R 2 in the ordinary least squares regression. Findings show that pseudo R 2 measures for logistic regression range from 3.6% to 6.5% and for probit analysis range from 3.5% to 6.3%. These percentages reflect the reductions in the log likelihood of the models due to including leadership, motivation, creativity and learning characteristics variables. 5.3 The third research question Are there differences in independent variables abilities in logistic regression and probit analysis? The following table shows independent variable coefficients and corresponding Wald Statistics of logistic regression and probit analysis. 7

8 Table 5: Wald tests for models coefficients Model Logistic regression Probit analysis Independent variable coefficient b Standard error Wald statistic df Sig. 95% Confidence interval Lower boundary Upper boundary creativity leadership motivation learning constant creativity leadership motivation learning Constant For logistic regression, Wald statistics show that only creativity characteristic was statistically significant at significance level. This means that only creativity has a significant impact on identifying gifted students. On the other hand, probit analysis shows that none of the independent variables (creativity, leadership, motivation and learning) is statistically significant, even creativity which is statistically significant in the logistic regression model. 5.4 The fourth research question How are logistic regression and probit analysis coefficients interpreted? In order to answer this question, the following ways of interpretations are used: The easiest way to interpret logistic regression coefficients is through logit. Creativity logit coefficient equals which means that with controlling the effects of other independent variables, the log odds of being gifted increases by for each unit increase in the creativity characteristic in the SRRBCSS. There is another way in which coefficients might be interpreted in the light of odds rather than the log odds. Thus, creativity logit coefficient could be transformed into creativity odds coefficient by taking the exponent logit coefficient and odds coefficient will be equal to 8

9 e b which equals in case of creativity variable e (=1.082). This means that for each unit increase in the creativity characteristic of SRRBCSS, odds of being a gifted student will be equal to the coefficient value times the factor of Also, this means that for an increase in the creativity characteristic by one unit, the odds of being a gifted student will increase by 0.82%. So, for any independent X and odds (O X ), if X increases by one unit to be X+1, Odds (O X+1 ) will equal O X times e b. So, the ratio of Odds (X+1) to Odds (X) is equal to: Odds Odds ( X + 1) ( X ) b Odds( X ) e b = = e Odds ( X ) (3) Thus, e b coefficient, called odds coefficient, could be known as odds ratio and abbreviated as OR. Finally, since the mathematical relationship between odds and probabilities is very well defined, researchers could conduct a simple calculation to convert odds into probabilities. This way, the effects on logged odds or odds could be translated into the effects on probabilities. However, since the relationship between the independent variables and probabilities is nonlinear, these effects on probabilities could not be represented by a single coefficient value and they have to be determined at a particular value of the independent variable. Probit coefficients, also, could be interpreted in different ways. The easiest way among these will be the direct interpretation through the probit transformation. Given that the probit formula is: (4) = + So, the probit coefficient from table 5 equals 0.047, this means that for one-unit increase in creativity characteristics, being identified as a gifted student increases by units of probit or the inverse of cumulative standard normal function. Also, since the relationship between probit and probabilities is well defined through cumulative standard normal distribution, probit coefficients could be interpreted as effects of independent variables on probabilities. However, similar to the case of logistic regression, there would be no single value of coefficient summaries the effects on probabilities at all levels of independent variable/variables and researches should explain the effects of independent variables change on the probabilities of dependent variable very carefully and only at certain levels of independent variables. 9

10 6 Discussion Both Logistic regression and probit analysis models fit the data of rating behavioral characteristics of superior students with the giftedness identification and even though the ways of linearity transformation were different, they gave essentially almost equivalent results. The measures used in both analyses to assess practical significance also showed identical results. Both models built in this study displayed a low level of practical significance even though the overall models were statistically significant. Logistic regression and probit analysis gave many measures of practical significance similar to R 2 in the OLS regression, but none of them is identical in its meaning to what OLS regression has. R 2 measures in logistic regression and probit analysis do not mean in any way the proportion of dependent variable variance explained by the independent variables of the model and simply that s why they are known as pseudo R 2. The logistic regression model showed that only creativity characteristic has a statistical significance and influence on identification of giftedness. On the other hand, probit analysis did not succeed to display the statistical significance of any independent variables to explain and predict the identification of gifted students. This finding might highlight a slight difference between logistic regression and probit analysis even though most statisticians believe that both analyses essentially give identical results to the extent they think that the choice between logistic regression and probit analysis is a matter of individual preferences (Pample, 2000, p.54). As Pample (2000) mentioned that because of the differences in coefficients result from the different variances of the transformed dependent variables, most Z scores in the probit analysis are slightly larger than those of the logistic regression and that s why creativity characteristic in the probit analysis was supposed to reach significance level easier than logistic regression (p.66). However, what Pample expected was not what occurred in this study and this is why researchers should be very careful when significance level values are very close to the critical values. The researcher agrees with Pample (2000) to conclude that the logistic regression and probit analysis coefficients vary slightly because of the small differences between the logistic and normal curves and almost both probit analysis and logistic regression produce similar substantive results (p.60). Logistic regression coefficients interpretation showed that the linear relationship between independent variables and the logged odds helps to summarize the impacts of independent variables on the dependent one with a single coefficient value. However, this linearizing transformation shifts the interpretation of coefficients from changes in probabilities to less intuitive changes in logged odds. 10

11 Probit coefficients show the linear and additive change in Z-score units of the probit transformation, but perhaps even less intuitive than the logged odds because standard units of the cumulative standard normal distribution have little interpretative values. Further, probit analysis does not allow calculation of equivalent of odds ratios and makes the calculation of changes in probabilities more difficult than in logistic regression (Pample,2000). That is why Pample (2000,p.68) said that in most circumstances, researchers will prefer logistic regression (p.68). He also said that given the usefulness of multiplicative odds coefficients in logistic regression, the lack of comparable coefficients in probit analysis may contribute to the greater popularity of logistic regression (p.61). Finally, probit and logistic regression results show that the logistic regression coefficients exceed the corresponding probit coefficients by a factor varying from 1.5 to 2.2 (Pample,2000,p.66). Part of the differences in coefficients results from the different variances of the transformed dependent variables. Most Z scores for the probit analysis are slightly larger than those for the logistic regression (or more precisely, than the square root of the Wald statistic). So, Z score of some independent variables in the probit analysis might reach the 0.05 level of significance while it does not in the logistic regression (Pample, 2000, p.67). 7 Conclusion Logistic regression and probit analysis use different techniques to transform the nonlinearity of dependent variable to become linear with independent variables. These transformations cause quite similar but not identical results in both analyses. The main advantage of logistic regression comparing with probit analysis is the various ways of coefficient interpretations which are more intuitive. Also, the study showed that only creativity characteristic of the SRBCSS is statistically significant in identifying students giftedness and that s why this study recommends to use just the creativity part of SRBCSS instead of using the full scale to identify gifted students. Finally, this study recommends carrying out more comparative investigations between logistic regression and probit analysis especially with different types of dependent variable distributions to uncover any possible systematic differences between the two analyses especially its conservatism level. References 1. Aldrich, John H. and Nelson, Forrest D. (1984). Linear Probability, Logit, and Probit Models. Sage University Paper series on Quantitative Applications in the Social Sciencies. No ). Beverly Halls, CA: Sage. 11

12 2. Al-Share et al (1999). Identifying and caring of gifted students (In Arabic). Riyadh: King Abdul-Aziz City for Science and Technology. 3. Al-Share et al (2001). Identifying and detecting gifted students (In Arabic). Riyadh: King Abdul-Aziz City for Science and Technology. 4. Borooh, Vani K. (2002). Logit and Probit: Ordered and Multinomial Models. Sage University Paper series on Quantitative Applications in the Social Sciencies. No ). Beverly Halls, CA: Sage. 5. Breslow, Norman and Holubkov, Richard (1997). Maximum Likelihood Estimation of Logistic Regression Parameters under Two-phase, Outcome-dependent Sampling. Royal Statistical Society. Vol.59, No.2,pp Cizek, Gregory J. & Fitzgerald, Shawn M. (1999). Methods, Plainly Speaking: An Introduction to Logistic regression. Measurement & Evaluation in Counseling and Development. Vol.31, January, Clinton, Abdul-Rahman Noor-Addin (1998). Scale for Rating Behavioral Characteristics of Superior Students (SRBCSS). Unpublished paper (In Arabic). 8. Dallal, Gerard E. (2001). Logistic Regression. Available at: 9. Draper, N. R. & Smith, H. (1981). Applied Regression Analysis. 2nd edition. New York: John Wiley & Sons. 10. Eliason, Scott R. (1993). Maximum Likelihood Estimation Logic and Practice. Sage University Paper series on Quantitative Applications in the Social Sciencies. No ). Beverly Halls, CA: Sage. 11. Fraas, John W. and Newman, Isadore (2003). Ordinary Least Squares Regression, Discriminant Analysis, and Logistic Regression: Questions Researchers and Practitioners Should Address When Selecting an Analytic Technique. Paper Presented at the Annual Meeting of the Eastern Educational Research Association (Hilton Head Island, GA, February 26-March 1,2003). 12. Fraas, John W.; Drushal, J. Michael; Graham, Jeff (2002). Expressing Logistic Regression Coefficients as Change in Initial Probability Values: Useful Information for Practitioners. Paper Presented at the Annual Meeting of the Mid-Western Educational Research Association (Columbus, Ohio, October 16-19,2002). 13. Gebotys, Robert (2000). Examples: Binary Logistic Regression. January, Guido, Joseph J., Winters, Paul C. & Rains, Adam B.(2006). Logistic regression Basics. University of Rochester Medical Center, Rochester, NY. Avialable at: Hanneman, Robert (w.d.). Multivariate Analysis. Department of Scociology. University of California. 16. Horton, Nicholas J. and Laird, Nan M. (2001). Maximum Likelihood Analysis of Logistic Regression Models with Incomplete Covariate Data and Auxiliary Information. Biometrics. Vol.57, pp.34-42, March

13 17. Hosmer, David W. & Lemeshow, Stanely (2000). Applied Logistic Regression. 2nd edition. New York: Johnson Wiley & Sons, Inc. 18. Houston, Walter M. & Woodruff, David J. (1997). Empirical Bayes Estimates of Parameters from the Logistic Regression Model. ACT Research Report Series American Coll. Testing Program, Iowa City, IA. 19. Johnson, Wesley & Watnik, Mitchell (2002). Interpretation of Regression Output: Diagnostics, Graphs & the Botton Line. University of California, USA. 20. Kerlinger, Fred N. & Pedhazur, Elazar(1973). Multiple Regression Behavioral Research. New York: Holt, Rinehart and Winston, Inc. 21. Kerlinger, Fred N. (1973). Foundations of Behavioral Research. 2nd edition. New York: Holt, Rinehart and Winston, Inc. 22. King, Gary and Zeng, Langehe (2001). Logistic Regression in Rare Events Data. Society for Political Methodology. February 16, King, Jason E. (2002). Logistic Regression: Going beyond Point-and-Click. Paper Presented at the Annual Meeting of the American Educational Research Association (New Orleans, LA, April 1-5,2002). 24. King, Jason E. (2003). Running A Best-Subsets Logistic Regression: An Alternative to Stepwise Methods. Educational and Psychological Measurement. Vol.63, No.3, June 2003, Kleinbaum, David & Klein, Mitchel (2002). Logistic Regression: a Self-learning Teext. USA: Springer. 26. Larson, Ray R. (2002). A Logistic Regression Approach to Distributed IR. University of California, Berkeley. School of Information Management and Systems. SIGIR 02, Tamper, Finland, August 11-15, Lea, Stephen (1997). Multivariate Analysis II: Manifest Variables Analysis. Topic 4: Logistic Regression and Discriminant Analysis. University of EXETER, Department of Psychology. Revised 11th March, Available at: Liao, Tim Futing (1994). Interpreting Probability Models Logit, Probit, and Other eneralized Linear Models. Sage University Paper series on Quantitative Applications in the Social Sciencies. No ). Beverly Halls, CA: Sage. 29. Maajeni, Osama Hasan & Howidi, Mohammad Abdulraziq (1995). Differences between Superior and Normal Students on Scale for Rating Behavioral Characteristics of Superior Students in the Primary Schools of Bahrain (In Arabic). Kuwait: Education Journal, No. 35, Vol. 9; Menard, Scott (2002). Applied Logistic Regression Analysis. 2nd edition. Sage University Paper series on Quantitative Applications in the Social Sciencies. No ). Beverly Halls, CA: Sage. 13

Logistic Regression: Regression with a Binary Dependent Variable

Logistic Regression: Regression with a Binary Dependent Variable LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the circumstances under which logistic regression