Logistic regression: Why we often can do what we think we can do. Maarten Buis 19 th UK Stata Users Group meeting, 10 Sept. 2015

Similar documents
Chapter 1 Introduction. What are longitudinal and panel data? Benefits and drawbacks of longitudinal data Longitudinal data models Historical notes

Interpreting and using heterogeneous choice & generalized ordered logit models

1 Motivation for Instrumental Variable (IV) Regression

EMERGING MARKETS - Lecture 2: Methodology refresher

Logistic Regression: Why We Cannot Do What We Think We Can Do, and What We Can Do About It

Sociology 593 Exam 2 Answer Key March 28, 2002

Unobserved heterogeneity in logistic regression

Econometrics I Lecture 7: Dummy Variables

Comparing Logit & Probit Coefficients Between Nested Models (Old Version)

Estimating the Marginal Odds Ratio in Observational Studies

ISQS 5349 Spring 2013 Final Exam

Sociology 593 Exam 2 March 28, 2002

Midterm 1 ECO Undergraduate Econometrics

ECON 594: Lecture #6

Chapter 2: simple regression model

Wooldridge, Introductory Econometrics, 3d ed. Chapter 9: More on specification and data problems

Marginal effects and extending the Blinder-Oaxaca. decomposition to nonlinear models. Tamás Bartus

Comparing IRT with Other Models

Gov 2000: 9. Regression with Two Independent Variables

Using Instrumental Variables to Find Causal Effects in Public Health

Sociology 593 Exam 1 February 17, 1995

Group comparisons in logit and probit using predicted probabilities 1

statistical sense, from the distributions of the xs. The model may now be generalized to the case of k regressors:

Comparing groups using predicted probabilities

Econometrics Lecture 5: Limited Dependent Variable Models: Logit and Probit

multilevel modeling: concepts, applications and interpretations

Review of Panel Data Model Types Next Steps. Panel GLMs. Department of Political Science and Government Aarhus University.

Lecture Discussion. Confounding, Non-Collapsibility, Precision, and Power Statistics Statistical Methods II. Presented February 27, 2018

The returns to schooling, ability bias, and regression

Ninth ARTNeT Capacity Building Workshop for Trade Research "Trade Flows and Trade Policy Analysis"

Chapter 14. Simultaneous Equations Models Introduction

Logistic regression: Uncovering unobserved heterogeneity

Longitudinal and Panel Data: Analysis and Applications for the Social Sciences. Table of Contents

The Simple Linear Regression Model

Using Matching, Instrumental Variables and Control Functions to Estimate Economic Choice Models

Systematic error, of course, can produce either an upward or downward bias.

Gov 2002: 4. Observational Studies and Confounding

An overview of applied econometrics

Multiple Linear Regression CIVL 7012/8012

Rockefeller College University at Albany

Summary and discussion of The central role of the propensity score in observational studies for causal effects

Notes 6: Multivariate regression ECO 231W - Undergraduate Econometrics

Propensity Score Methods for Causal Inference

Controlling for Time Invariant Heterogeneity

Notes 11: OLS Theorems ECO 231W - Undergraduate Econometrics

We're in interested in Pr{three sixes when throwing a single dice 8 times}. => Y has a binomial distribution, or in official notation, Y ~ BIN(n,p).

Biostatistics Advanced Methods in Biostatistics IV

Regression Discontinuity

WISE International Masters

ECO375 Tutorial 8 Instrumental Variables

Marginal, crude and conditional odds ratios

Non-linear panel data modeling

Partial effects in fixed effects models

36-309/749 Math Review 2014

Mixed Models for Longitudinal Ordinal and Nominal Outcomes

Measuring Social Influence Without Bias

Using predictions to compare groups in regression models for binary outcomes

Limited Dependent Variable Models II

Limited Dependent Variables and Panel Data

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Econometrics. 7) Endogeneity

Applied Econometrics (MSc.) Lecture 3 Instrumental Variables

Making sense of Econometrics: Basics

Econometrics in a nutshell: Variation and Identification Linear Regression Model in STATA. Research Methods. Carlos Noton.

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

1 A Non-technical Introduction to Regression

8. Instrumental variables regression

In order to carry out a study on employees wages, a company collects information from its 500 employees 1 as follows:

Lecture 3.1 Basic Logistic LDA

1. Regressions and Regression Models. 2. Model Example. EEP/IAS Introductory Applied Econometrics Fall Erin Kelley Section Handout 1

Mid-term exam Practice problems

The Consequences of Unobserved Heterogeneity in a Sequential Logit Model

Simple ways to interpret effects in modeling ordinal categorical data

Introduction to causal identification. Nidhiya Menon IGC Summer School, New Delhi, July 2015

Applied Microeconometrics (L5): Panel Data-Basics

Treatment Effects with Normal Disturbances in sampleselection Package

Ability Bias, Errors in Variables and Sibling Methods. James J. Heckman University of Chicago Econ 312 This draft, May 26, 2006

Applied Health Economics (for B.Sc.)

Econometrics Problem Set 4

Sociology Exam 2 Answer Key March 30, 2012

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data?

Robustness of Logit Analysis: Unobserved Heterogeneity and Misspecified Disturbances

ESTIMATING AVERAGE TREATMENT EFFECTS: REGRESSION DISCONTINUITY DESIGNS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics

Binary Choice Models Probit & Logit. = 0 with Pr = 0 = 1. decision-making purchase of durable consumer products unemployment

SIMULATION-BASED SENSITIVITY ANALYSIS FOR MATCHING ESTIMATORS

APPENDIX 1 BASIC STATISTICS. Summarizing Data

Comparing groups in binary regression models using predictions

Chapter 1. Modeling Basics

Econ 1123: Section 2. Review. Binary Regressors. Bivariate. Regression. Omitted Variable Bias

Linear Regression with one Regressor

ECON The Simple Regression Model

Chapter 9 Regression with a Binary Dependent Variable. Multiple Choice. 1) The binary dependent variable model is an example of a

Sociology 593 Exam 1 Answer Key February 17, 1995

Econometrics of causal inference. Throughout, we consider the simplest case of a linear outcome equation, and homogeneous

REVIEW 8/2/2017 陈芳华东师大英语系

A Course in Applied Econometrics Lecture 7: Cluster Sampling. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

Selection endogenous dummy ordered probit, and selection endogenous dummy dynamic ordered probit models

Class Introduction and Overview; Review of ANOVA, Regression, and Psychological Measurement

Generalized Linear Models for Non-Normal Data

NELS 88. Latent Response Variable Formulation Versus Probability Curve Formulation

Transcription:

Logistic regression: Why we often can do what we think we can do Maarten Buis 19 th UK Stata Users Group meeting, 10 Sept. 2015

1 Introduction Introduction - In 2010 Carina Mood published an overview article called Logistic regression: Why we cannot do what we think we can do, and what we can do about it - Her main conclusions were: 1. It is problematic to interpret odds ratios as substantive effects,because they also reflect unobserved heterogeneity. 2. It is problematic to compare odds ratios across models with different independent variables,because the unobserved heterogeneity is likely to vary across models. 3. It is problematic to compare odds ratios across groups,because the unobserved heterogeneity can vary across the compared groups. - This article had a big impact in sociology making many researchers unsure whether logistic regression and odds ratios have any use at all. 2 / 16 10 Sept. 2015 Logistic regression: Why we often can do what we think we can do Maarten Buis

1 Introduction History - The article by Mood is a review article, so the points she made are not new: - There was an active debate in the bio-medical sciences (e.g. Gail et al. 1984). - The issue is known in economics (e.g. Lee 1982), but does not play a big role as odds ratios are not popular there anyhow. - Some early work did happen in sociology (e.g. McKelvey and Zavoina 1975), but never reached a large audience. 3 / 16 10 Sept. 2015 Logistic regression: Why we often can do what we think we can do Maarten Buis

2 The problem The first indication - Many studies start with the observation that if one adds a variable to a logistic regression model the remaining coefficients will change even if the added variable is uncorrelated with the other variables. - The easiest explanation of this phenomenon starts with the latent variable representation of logistic regression. - This assumes that there is a latent propensity for experiencing a success. - One experiences the success if the propensity passes a threshold (0). - The propensity of success is a linear function of the explanatory variables plus an error term - So, the probability of success is: y i = x i + " i Pr(y i = 1jx ) = Pr(y > 0) i = Pr(" i > x i ) exp(x i ) = 1 + exp(x i ) if" ( 0; ) p 3 4 / 16 10 Sept. 2015 Logistic regression: Why we often can do what we think we can do Maarten Buis

2 The problem The scale of the latent variable - Notice, that I needed to fix the standard deviation of the error term - The reason is that y is latent, so its scale is unknown - The scale of the latent variable is fixed by fixing the standard deviation of the error term - var(y ) = var(x i ) + var(") - What happens when we add a variable to our model? - That extra variable is removed from the error term, so the variance of the error term decreases (assuming that the variable is uncorrelated with the error term). - But the scale of the dependent variable was defined by fixing the scale of the residual. - So the scale of the dependent variable depends on which variables are in the model. 5 / 16 10 Sept. 2015 Logistic regression: Why we often can do what we think we can do Maarten Buis

2 The problem Mood s three problems problem 1 The scale of the dependent variable depends on which variables are in the model, making it problematic to interpret the resulting coefficients the effect. problem 2 The scale of the dependent variable changes when one adds variables, making comparison of effects problematic. problem 3 Say we want to compare groups and the residual variances is likely to differ across groups (e.g. comparing effects on the probability of some labor market outcome between men and women) problem 3 In that case the scale of the dependent variable differs, making the comparison of effects problematic. 6 / 16 10 Sept. 2015 Logistic regression: Why we often can do what we think we can do Maarten Buis

2 The problem Is the scale of the dependent variable really unidentified? - There is a different way of looking at logistic regression that does not involve a latent dependent variable - In that view logistic regression is a linear model for the log odds of success. - An odds is just an alternative way to quantify how likely a success is: - Instead of considering the expected proportion of successes (probability) you look at the expected number of successes per failure (odds) - So an odds can take any value larger or equal to zero - The log of the odds can take any value - The scale of the log odds is known and does not change when adding or removing variables or comparing groups. - However, this does not solve everything as the coefficients still change when we add or remove uncorrelated variables. 7 / 16 10 Sept. 2015 Logistic regression: Why we often can do what we think we can do Maarten Buis

3 A Solution previous attempts at solving the problem - use a linear probability model and approximations thereof (average marginal effects) as these don t react to uncorrelated additional variables. ( and ) - standardize the dependent variable ( in the package) - estimate the variance of the residual ( ) 8 / 16 10 Sept. 2015 Logistic regression: Why we often can do what we think we can do Maarten Buis

3 A Solution My attempt - current state: I had an intuition for a long time and finally taken time to write a it down in a working-paper. However, a more formal representation would improve that paper. - I start with the question: Is there is a problem that needs solving? - One can think of logistic regression as trying to model the proportion of successes - The proportion is a characteristic of a group. - One can turn it into an individual level characteristic by saying that it is a probability - A probability is an assessment of how likely we think that the event happens - If we add (relevant) information this assessment should change - Adding information in case of logistic regression means adding variables. - So probabilities only exist within the context of a specific model. - For example, adding a variable does not lead to an improved estimate of the probability, but leads to an estimate of a different probability. 9 / 16 10 Sept. 2015 Logistic regression: Why we often can do what we think we can do Maarten Buis

3 A Solution How should effect react to new information? - If we become surer because of the new information our probabilities can become - closer to 0 if we become surer that the event will not happen - closer to 1 if we become surer that the event will happen - So after adding new information there is more room for a variable to have an effect and the effects should increase - The more relevant the new information is, the larger the increase - The odds ratios from logistic regression show exactly this behavior - So the odds ratios can be a meaningful effect size. - However, one needs to specify which variables were in the model, but that makes sense in this interpretation of the dependent variable. - Linear probability models and Average Marginal Effects have been proposed as solutions because they don t change when adding non-confounding variables. - According to my argument, this would actually make them problematic. 10 / 16 10 Sept. 2015 Logistic regression: Why we often can do what we think we can do Maarten Buis

3 A Solution Comparing groups - Say we compare the effect on a labor market outcome between men and women. - The labor market experiences for men tend to be more predictable than that of women. - So we are probably surer in the group of men - So the predicted probabilities can be closer to 0 and 1 for men compared to women - and there is thus more room for a variable to have an effect - The odds ratios from logistic regression show exactly this behavior - So a comparison of odds ratios across groups provides an accurate description of the difference in effect. 11 / 16 10 Sept. 2015 Logistic regression: Why we often can do what we think we can do Maarten Buis

3 A Solution Causality in group comparisons - What would happen to the effect if we turn a man into a women (without the stigma usually associated with such a change)? - The uncertainty the probability measures comes from many variables we did not observe - The sum of the effects of each of these variables is the error term, and the variance of that error term captures the amount of uncertainty - The variance could differ across groups - because the variances of the unobserved variables differ and/or - because their effects differ. - If the difference in residual variance is only due to a difference in effects of the unobserved variables, - then a men draws his unobserved variables from the same distribution as the women, only their effects differ. - So the effects in the female sub-sample represents the counter-factual effects (assuming all other conditions are met). 12 / 16 10 Sept. 2015 Logistic regression: Why we often can do what we think we can do Maarten Buis

3 A Solution Comparing across models - Say we want to explain who enters university and we have parental status and previous school performance - part of the effect of parental status is due to the fact that - high status children tend to do well in school and - those that do well are more likely to enter university - An attempt to quantify this indirect of parental status through school performance is often done by comparing - the effect of parental status in a model with only parental status with - the effect of parental status in a model with parental status and previous school performance - This won t work in logistic regression, as the effect of parental status changes not just because of the correlation between parental status and school performance but also because of the extra certainty obtained from adding school performance. - We can use or for these problems. 13 / 16 10 Sept. 2015 Logistic regression: Why we often can do what we think we can do Maarten Buis

4 Conclusions Conclusions 1. The odds ratio is a meaningful effect-size. The fact that it is dependent on which variables are included in the model is not a problem but actually a requirement for an effect on a probability. 2. It is indeed problematic to compare coefficients across models with different sets of explanatory variables, since effects on probabilities are supposed to change when variables are added to the model even if they are uncorrelated with the other explanatory variables. 3. A comparison of odds ratios across groups provides an accurate description of the difference in effects across these groups, and under special circumstances can also be given a causal interpretation. - Points where I am still uncertain are: - How does this argument work when the added information from a new variable is inconsistent with what we knew before? - In my argument the linear probability model is not appropriate, but it is usually more helpful to say what it does measure. - My hope is that a more formal representation of this argument will help solve those questions. 14 / 16 10 Sept. 2015 Logistic regression: Why we often can do what we think we can do Maarten Buis

Thank you! - This is work in progress, so I welcome all questions and comments - There is a working paper available from 15 / 16 10 Sept. 2015 Logistic regression: Why we often can do what we think we can do Maarten Buis

References M. H. Gail, S. Wieand, and S. Piantadosi Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates Biometrika, 71(3):431 444, 1984. L.-F. Lee Specification error in multinomial logit models: Analysis of the omitted variable bias Journal of Econometrics, 20(2):197 209, 1982. R. D. McKelvey and W. Zavoina A Statistical Model for the Analysis of Ordinal Level Dependent Variables Journal of Mathematical Sociology, 4(1):103 120, 1975. C. Mood Logistic regression: Why we cannot do what we think we can do, and what we can do about it European Sociological Review, 26(1):67 82, 2010. 16 / 16 10 Sept. 2015 Logistic regression: Why we often can do what we think we can do Maarten Buis