How likely is Simpson s paradox in path models?

Similar documents
Simpson s paradox, moderation, and the emergence of quadratic relationships in path models: An information systems illustration

Statistical power with respect to true sample and true population paths: A PLS-based SEM illustration

Using Estimating Equations for Spatially Correlated A

Dynamic Regression Models (Lect 15)

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Monte Carlo Integration

TESTING FOR NORMALITY IN THE LINEAR REGRESSION MODEL: AN EMPIRICAL LIKELIHOOD RATIO TEST

COMPARISON OF GMM WITH SECOND-ORDER LEAST SQUARES ESTIMATION IN NONLINEAR MODELS. Abstract

Time Series: Theory and Methods

A Bootstrap Test for Conditional Symmetry

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Econometric Analysis of Cross Section and Panel Data

Markov Chain Monte Carlo methods

Estimation of the Conditional Variance in Paired Experiments

COMPOSITIONAL IDEAS IN THE BAYESIAN ANALYSIS OF CATEGORICAL DATA WITH APPLICATION TO DOSE FINDING CLINICAL TRIALS

ON ILL-POSEDNESS OF NONPARAMETRIC INSTRUMENTAL VARIABLE REGRESSION WITH CONVEXITY CONSTRAINTS

An Introduction to Mplus and Path Analysis

Modelling Academic Risks of Students in a Polytechnic System With the Use of Discriminant Analysis

OPTIMAL DESIGN INPUTS FOR EXPERIMENTAL CHAPTER 17. Organization of chapter in ISSO. Background. Linear models

A general mixed model approach for spatio-temporal regression data

Data analysis strategies for high dimensional social science data M3 Conference May 2013

Bootstrapping the Grainger Causality Test With Integrated Data

Model Estimation Example

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

Key Algebraic Results in Linear Regression

Stat 5101 Lecture Notes

Multiple regression and inference in ecology and conservation biology: further comments on identifying important predictor variables

An Introduction to Path Analysis

Comparing latent inequality with ordinal health data

Analysis Methods for Supersaturated Design: Some Comparisons

Bayesian Analysis of Multivariate Normal Models when Dimensions are Absent

Commentary on Basu (1956)

Interpretable Latent Variable Models

Experimental Design and Data Analysis for Biologists

Chapter 5. Introduction to Path Analysis. Overview. Correlation and causation. Specification of path models. Types of path models

Financial Econometrics

Harvard University. Rigorous Research in Engineering Education

Testing Statistical Hypotheses

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

Hakone Seminar Recent Developments in Statistics

Bayesian Inference. Chapter 1. Introduction and basic concepts

Modified Variance Ratio Test for Autocorrelation in the Presence of Heteroskedasticity

Bayes methods for categorical data. April 25, 2017

A TIME SERIES PARADOX: UNIT ROOT TESTS PERFORM POORLY WHEN DATA ARE COINTEGRATED

A Signed-Rank Test Based on the Score Function

A comparison study of the nonparametric tests based on the empirical distributions

Minimum Hellinger Distance Estimation in a. Semiparametric Mixture Model

Extreme Value Analysis and Spatial Extremes

A Test of Cointegration Rank Based Title Component Analysis.

Comparison of Three Calculation Methods for a Bayesian Inference of Two Poisson Parameters

Computationally Efficient Estimation of Multilevel High-Dimensional Latent Variable Models

26:010:557 / 26:620:557 Social Science Research Methods

Econ 423 Lecture Notes: Additional Topics in Time Series 1

Logistic Regression: Regression with a Binary Dependent Variable

Regression, Ridge Regression, Lasso

MONTE CARLO ANALYSIS OF CHANGE POINT ESTIMATORS

Likelihood-Based Methods

Parametric Inference on Strong Dependence

INTRODUCTION TO INTERSECTION-UNION TESTS

Verifying Regularity Conditions for Logit-Normal GLMM

Generalized Linear Models (GLZ)

A proof of Bell s inequality in quantum mechanics using causal interactions

October 1, Keywords: Conditional Testing Procedures, Non-normal Data, Nonparametric Statistics, Simulation study

Forecasting 1 to h steps ahead using partial least squares

Controlling for latent confounding by confirmatory factor analysis (CFA) Blinded Blinded

Quantile regression and heteroskedasticity

A Test for Order Restriction of Several Multivariate Normal Mean Vectors against all Alternatives when the Covariance Matrices are Unknown but Common

Shu Yang and Jae Kwang Kim. Harvard University and Iowa State University

Problems with Penalised Maximum Likelihood and Jeffrey s Priors to Account For Separation in Large Datasets with Rare Events

A Note on Auxiliary Particle Filters

Chapter 8: Regression Models with Qualitative Predictors

Monte Carlo Integration I [RC] Chapter 3

ECON 4160, Lecture 11 and 12

Nonparametric Identification of a Binary Random Factor in Cross Section Data - Supplemental Appendix

Does k-th Moment Exist?

Testing Statistical Hypotheses

Bayesian Quadrature: Model-based Approximate Integration. David Duvenaud University of Cambridge

Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error Distributions

Better Bootstrap Confidence Intervals

The Tetrad Criterion

Simultaneous Equations and Weak Instruments under Conditionally Heteroscedastic Disturbances

INTERVAL ESTIMATION AND HYPOTHESES TESTING

ABHELSINKI UNIVERSITY OF TECHNOLOGY

Dynamic Discrete Choice Structural Models in Empirical IO

Collaborative topic models: motivations cont

Discussion of the paper Inference for Semiparametric Models: Some Questions and an Answer by Bickel and Kwon

November 2002 STA Random Effects Selection in Linear Mixed Models

Research Design - - Topic 15a Introduction to Multivariate Analyses 2009 R.C. Gardner, Ph.D.

An Akaike Criterion based on Kullback Symmetric Divergence in the Presence of Incomplete-Data

Truncated Regression Model and Nonparametric Estimation for Gifted and Talented Education Program

Package noncompliance

Approximate and Fiducial Confidence Intervals for the Difference Between Two Binomial Proportions

Non-maximum likelihood estimation and statistical inference for linear and nonlinear mixed models

Long-Run Covariability

Appendix: Modeling Approach

Bootstrap tests of multiple inequality restrictions on variance ratios

Eric Shou Stat 598B / CSE 598D METHODS FOR MICRODATA PROTECTION

Supplemental material to accompany Preacher and Hayes (2008)

Joint Estimation of Risk Preferences and Technology: Further Discussion

Confidence Intervals of the Simple Difference between the Proportions of a Primary Infection and a Secondary Infection, Given the Primary Infection

Transcription:

How likely is Simpson s paradox in path models? Ned Kock Full reference: Kock, N. (2015). How likely is Simpson s paradox in path models? International Journal of e- Collaboration, 11(1), 1-7. Abstract Simpson s paradox is a phenomenon arising from multivariate statistical analyses that often leads to paradoxical conclusions; in the field of e-collaboration as well as many other fields where multivariate methods are employed. We derive a general inequality for the occurrence of Simpson s paradox in path models with or without latent variables. The inequality is then used to estimate the probability that Simpson s paradox would occur at random in path models with two predictors and one criterion variable. This probability is found to be approximately 12.8 percent; slightly higher than 1 occurrence per 8 path models. This estimate suggests that Simpson s paradox is likely to occur in empirical studies, in the field of e-collaboration and other fields, frequently enough to be a source of concern. Keywords: E-collaboration; Simpson s paradox; path analysis; numeric computation; Monte Carlo simulation. 1

Introduction Simpson s paradox, also known as the Yule Simpson effect and the reversal paradox, is a phenomenon arising from multivariate statistical analyses (Pearl, 2009; Wagner, 1982). It is called a paradox because it often leads to paradoxical conclusions. Such conclusions may lead to the development of theories that incorporate causal effects disconnected from reality, based on empirical findings distorted by Simpson s paradox. This applies to the field of e-collaboration as well as many other fields where multivariate methods are employed. In the following sections, we provide an illustration of Simpson s paradox in a path model, followed by a discussion of past estimates of the likelihood of Simpson s paradox in contingency tables. We then provide a mathematical definition of Simpson s paradox in path models, in the form of a basic inequality, which we use as a basis for the development of a more general Simpson s paradox inequality that can be used in numeric estimations of the paradox s probability. Finally, we illustrate the use of this general inequality to estimate the probability that Simpson s paradox would occur at random in three-variable path models, which is found to be 12.8 percent; slightly higher than 1 8. For simplicity, and without any impact on the generality of our discussion, we assume that all variables are scaled to have a mean of zero and a standard deviation of one. A path model illustration of Simpson s paradox Let us assume that we collected data from 300 firms about two variables: degree of collaborative management (X) and firm success (Z). The variable degree of collaborative management (X) measures the degree to which managers and employees collaborate to continuously improve their firms productivity and the quality of their firms products. The variable firm success (Z) measures the profitability of each firm. Figure 1 shows a simple path model relating these two variables. Since this path model contains only two variables, then p = r = 0.5; where p and r denote the path coefficient and the correlation between the two variables. Figure 1: Two-variable path model Figure 2 shows a slightly more complex path model with an additional variable pointing at Z: degree of e-collaboration technology use (Y). This new variable measures the degree to which an e-collaboration technology is used. The technology facilitates collaborative management and is available in all firms studied. Because of this, firms where the degrees of collaborative management (X) are high tend to also use the e-collaboration technology intensely, and thus present high degrees of e-collaboration technology use (Y); hence the link X Y in the model. 2

Figure 2: Three-variable path model In this example, the addition of the new variable led the path coefficient p for the link between the variables degree of collaborative management (X) and firm success (Z) to assume a negative value (-0.2), in contrast with the positive correlation r (0.5) between the same variables. This sign reversal characterizes what is known as Simpson s paradox in path models. The likelihood of Simpson s paradox in contingency tables Simpson s paradox is generally perceived as a problematic phenomenon, since it leads to paradoxical conclusions based on empirical research (Pearl, 2009; Wagner, 1982). In the threevariable path model illustration above, the results would lead many researchers to believe that the association between degree of collaborative management (X) and firm success (Z) is negative. The perception that Simpson s paradox is a problematic phenomenon has motivated Pavlides & Perlman (2009), in a seminal article on Simpson s paradox, to estimate the probability that Simpson s paradox would occur at random. They focused on contingency tables, and made various assumptions, also estimating other conditional probabilities. The probability that Simpson s paradox would occur at random in contingency tables was found to be fairly low, in the neighborhood of 1 to 2 percent. Contingency tables summarize results of multivariate analyses involving categorical variables. Often the categorical variables assume two values each, leading to 2 x 2 tables. Multivariate analyses involving such categorical variables frequently take the form of ANCOVA analyses. Analyses summarized via 2 x 2 tables could be seen as special cases of path analyses where some of the variables are dichotomous. The probability that Simpson s paradox would occur at random obtained by Pavlides & Perlman (2009) could lead empirical researchers to reasonably conclude that Simpson s paradox is an unlikely phenomenon, which should occur only very rarely in empirical research. 3

However, Pavlides & Perlman s (2009) focus on multivariate analyses involving categorical variables may have led to a probability estimate that is significantly lower than that for path analyses, which typically include variables measured on ratio scales. This possibility provided the motivation for the current study. A mathematical definition of Simpson s paradox In this section we provide a mathematical definition of Simpson s paradox, in the form of a basic inequality. This basic inequality is necessary for the development of a more general Simpson s paradox inequality, which in turn can be used in numeric estimations of the paradox s probability. Let X and Z be stochastic variables in a linear path model (Wright, 1934; 1960) where X is hypothesized to directly cause Z. The relationship between these two variables can be expressed as Z = r X + ε, where r is the correlation between X and Z, and ε is the error term that accounts for the variance in Z that is not explained by X. Let Y be another stochastic variable that is correlated with X and that is a true direct cause of Z. The relationship among the variables X, Y and Z can be expressed as Z = p X + p Y + θ, where p and p are the path coefficients for the links X Z and Y Z, respectively, and θ is a new error term that accounts for the variance in Z that is not explained by X and Y. Simpson s paradox in connection with the link X Z occurs when p r < 0. (1) Simpson s paradox in connection with any link among two variables, such as the link Y Z, can be analogously defined (e.g., p r < 0). This simple inequality refers to each link in the path model, and provides the basis for a straightforward test to identify Simpson s paradox occurrences: if the path coefficient and the correlation associated with any link in the path model assume different signs, then an instance of Simpson s paradox exists in connection with the link. The simple inequality above, used to mathematically define Simpson s paradox, forms the foundation on which a more general Simpson s paradox inequality for numeric estimation of probabilities is developed. A general Simpson s paradox inequality In this section we derive a more general inequality, relying only on correlations, which can be used in numeric estimation of unconditional and conditional probabilities that Simpson s paradox would occur. The inequality is stated as a theorem, whose proof follows. 4

Theorem. Let r, r and r be the correlations among the stochastic variables X, Y and Z in a linear path model, where X and Y are hypothesized to directly cause Z. Then Simpson s paradox occurs in connection with the link X Z when r < r r r. (2) Proof. We know from Wright (1934) that the path coefficient p can be expressed in terms of correlations as p =. (3) Multiplying both sides of (3) by r yields p r = r. (4) Comparing (1) and (4), we see that Simpson s paradox occurs when r < 0. (5) Since 1 r is always nonnegative, we can reduce (5) to r < r r r. We can see from this general inequality that Simpson s paradox will not occur when all of the correlations are zero. Also, given that r is always nonnegative, Simpson s paradox will never occur when one of the correlations has the opposite sign of the other two (e.g., r < 0, r > 0 and r > 0). Conversely, if all correlations are positive, then Simpson s paradox will always occur in connection with the link X Z when r < r r. This general inequality is particularly useful in numeric estimations of the likelihood of Simpson s paradox based on Monte Carlo simulations. This is because it contains only correlations, which vary within set bounds. The earlier inequality used to derive this general inequality (i.e., p r < 0), on the other hand, contains a path coefficient (p ), which can vary from to +. How likely is Simpson s paradox in t h r e e -variable path models? Since the correlations in (2) vary from -1 to 1, the general Simpson s paradox inequality can be used directly in Monte Carlo simulations (Delgado, 1993; Robert & Casella, 2005) to estimate unconditional and conditional probabilities of the occurrence of Simpson s paradox. Let us consider the unconditional probability, which is the probability that Simpson s paradox would occur at random. Let C be the set of combinations of random values of correlations {r, r, r } r R 1 < r < 1 i, j {X, Y, Z}, generated through a Monte Carlo simulation. Let S be the subset of combinations within C that satisfy (2). The probability that Simpson s paradox would occur at random is given by 5

P = as C, where S and C are the cardinalities of S and C, respectively. We estimated P by generating a large number of combinations (10,000) of random values of correlations, and repeating this process a number of times (1,000) to obtain an asymptotic probability estimate. Through this process we estimated P to be approximately 12.8 percent. That is, of all possible three-variable path models, slightly more than 1 in 8 will contain an instance Simpson s paradox. This suggests that Simpson s paradox is likely to occur in empirical studies frequently enough to be a source of concern. This may lead to the development of theories, based on empirical findings distorted by Simpson s paradox, which incorporate causal effects that are disconnected from reality. Concluding remarks and future research The probability that Simpson s paradox would occur at random in three-variable path models, which we estimated to be approximately 12.8 percent, may be an underestimation of the actual rate of occurrence of Simpson s paradox in empirical studies employing path analysis. There are various reasons for this supposition. One of them is that empirical studies are designed to test models where correlations among variables are often expected to fall within certain ranges, where conditional Simpson s paradox probabilities may be greater than the probability that Simpson s paradox would occur at random. Also, empirical studies often employ models that are more complex than a three-variable path model. To assess this supposition in the context of actual empirical studies employing path models, we conducted detailed reviews of articles published in the 2003-2013 period in the journal MIS Quarterly, a highly selective academic journal in the applied field of information systems. E- collaboration is frequently seen as a subfield of the field of information systems. We focused on articles that employed path analysis and related methods multiple regression analysis and structural equation modeling where both path coefficients and correlations are normally reported. Among the articles reviewed in this period in MIS Quarterly, 19 contained instances of Simpson s paradox. These amounted to approximately 15.7 percent of all articles published in that journal employing path analysis and related methods. The estimation of conditional probabilities via Monte Carlo simulations, where correlations among variables are expected to fall within certain ranges, is recommended as future research. One such range could be 0.3 < r < 0.8 i, j {X, Y, Z}, which would include correlations that arguably are strong enough to yield significant path coefficients at commonly used sample sizes (e.g., N > 150), but not as strong as to lead to vertical or lateral collinearity (Kock & Lynn, 2012) among variables. Acknowledgments The author thanks Wei Ning and Pelumi Ojomu for their help in the review of published articles and identification of Simpson s paradox occurrences. 6

References Delgado, M.A. (1993). Testing the equality of nonparametric regression curves. Statistics & Probability Letters, 17(3), 199-204. Kock, N., & Lynn, G.S. (2012). Lateral collinearity and misleading results in variance-based SEM: An illustration and recommendations. Journal of the Association for Information Systems, 13(7), 546-580. Pavlides, M.G., & Perlman, M.D. (2009). How likely is Simpson s Paradox? The American Statistician, 63(3), 226-233. Pearl, J. (2009). Causality: Models, reasoning, and inference. Cambridge, England: Cambridge University Press. Robert, C.P., & Casella, G. (2005). Monte Carlo statistical methods. New York, NY: Springer. Wagner, C.H. (1982). Simpson's paradox in real life. The American Statistician, 36(1), 46 48. Wright, S. (1934). The method of path coefficients. The Annals of Mathematical Statistics, 5(3), 161-215. Wright, S. (1960). Path coefficients and path regressions: Alternative or complementary concepts? Biometrics, 16(2), 189-202. 7