Predicting the Treatment Status

Similar documents
Introduction to causal identification. Nidhiya Menon IGC Summer School, New Delhi, July 2015

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

Machine Learning Linear Classification. Prof. Matteo Matteucci

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Research Design: Causal inference and counterfactuals

EM (cont.) November 26 th, Carlos Guestrin 1

Latent Variable Models

Treatment Effects. Christopher Taber. September 6, Department of Economics University of Wisconsin-Madison

Bias-Variance Tradeoff

Mixtures of Gaussians continued

AGEC 661 Note Fourteen

Impact Evaluation Technical Workshop:

A brief introduction to Conditional Random Fields

Undirected Graphical Models

Gaussian Quiz. Preamble to The Humble Gaussian Distribution. David MacKay 1

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Topic 2: Logistic Regression

CS534 Machine Learning - Spring Final Exam

1 EM Primer. CS4786/5786: Machine Learning for Data Science, Spring /24/2015: Assignment 3: EM, graphical models

Conditional probabilities and graphical models

Econ 1123: Section 2. Review. Binary Regressors. Bivariate. Regression. Omitted Variable Bias

Probability deals with modeling of random phenomena (phenomena or experiments whose outcomes may vary)

Be able to define the following terms and answer basic questions about them:

Using Probability to do Statistics.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Statistical Analysis of Randomized Experiments with Nonignorable Missing Binary Outcomes

Structure learning in human causal induction

A Primer on Statistical Inference using Maximum Likelihood

Dynamics in Social Networks and Causality

Gaussian Mixture Models

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Gov 2002: 4. Observational Studies and Confounding

Midterm, Fall 2003

1 Impact Evaluation: Randomized Controlled Trial (RCT)

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Introduction to Bayesian Learning

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

CPSC 540: Machine Learning

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Instrumental Variables

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Notes slides from before lecture. CSE 21, Winter 2017, Section A00. Lecture 16 Notes. Class URL:

Machine Learning

Algorithm-Independent Learning Issues

Chapter 16. Structured Probabilistic Models for Deep Learning

Learning in Bayesian Networks

Advanced Introduction to Machine Learning

ECE521 Lecture7. Logistic Regression

Week 5: Logistic Regression & Neural Networks

Sensitivity checks for the local average treatment effect

Potential Outcomes Model (POM)

Problems for 2.6 and 2.7

FINAL: CS 6375 (Machine Learning) Fall 2014

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Combining Non-probability and Probability Survey Samples Through Mass Imputation

Introduction to Causal Inference. Solutions to Quiz 4

Quantitative Economics for the Evaluation of the European Policy

Introduction to Econometrics

Notes 6: Multivariate regression ECO 231W - Undergraduate Econometrics

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Introduction to Machine Learning HW6

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Causal Inference with General Treatment Regimes: Generalizing the Propensity Score

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

1. Regressions and Regression Models. 2. Model Example. EEP/IAS Introductory Applied Econometrics Fall Erin Kelley Section Handout 1

Machine Learning, Fall 2009: Midterm

Review of probabilities

Notes on Discriminant Functions and Optimal Classification

Probabilistic Graphical Models

Impact Evaluation of Mindspark Centres

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Data Preprocessing. Cluster Similarity

Logistic Regression. William Cohen

Midterm 2 V1. Introduction to Artificial Intelligence. CS 188 Spring 2015

Regression with Numerical Optimization. Logistic

Computational Cognitive Science

STA 4273H: Statistical Machine Learning

Review and Motivation

Gaussian and Linear Discriminant Analysis; Multiclass Classification

1 Motivation for Instrumental Variable (IV) Regression

Assessing Studies Based on Multiple Regression

The Naïve Bayes Classifier. Machine Learning Fall 2017

Statistical Models for Causal Analysis

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Quantitative Methods for Decision Making

CSE446: Clustering and EM Spring 2017

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Technical Track Session I:

The Supervised Learning Approach To Estimating Heterogeneous Causal Regime Effects

Expectation Maximization Algorithm

What is to be done? Two attempts using Gaussian process priors

6.867 Machine learning

Notes 11: OLS Theorems ECO 231W - Undergraduate Econometrics

Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence

Machine Learning for Signal Processing Bayes Classification and Regression

Transcription:

Predicting the Treatment Status Nikolay Doudchenko 1 Introduction Many studies in social sciences deal with treatment effect models. 1 Usually there is a treatment variable which determines whether a particular individual (or more generally a unit of observation) was affected by some impact (the binary case) or the intensity of this impact. A researcher is interested in evaluating the effect of the treatment measured as some function of the joint distribution of y (i) 1, the potential outcome for individual i conditional on being treated, and y (i) 0, the potential outcome conditional on not being treated. There are many papers devoted to studying this setup due to a very fundamental problem that for each individual i either y (i) 1 or y (i) 0 is observed in the data, but never both. In my project I consider a somewhat different problem. What if we don t know who has got the treatment and who has not? For example, suppose that we organize a lottery in which every participant draws a lottery ticket from a big jar. Some of the tickets are the winning ones (the individual is treated) while some aren t (the individual is not treated). We cannot observe whether a particular person has won, but we can observe some outcome potentially affected by the realization of the lottery. Can we infer anything about the treatment? 2 To be precise, I ask two questions: (i) Can we guess the treatment status? (ii) Is it possible to measure the treatment effect even though we don t observe the treatment dummy? I approach this problem as an unsupervised learning problem. Although the treatment status is observed in the data that I use, I assume that in practice it won t be and there will be no way to train the model. The main tool in my analysis is what I call the Hidden Variable Model (HVM). Essentially, this is a hidden Markov model with the main difference being that there is no Markov structure and instead of observing a number of consecutive realizations for one individual we observe a number of independent individuals. The hidden state is exactly the treatment status which we don t observe. We may also know some additional details of the assignment process or how the treatment affects the outcome. For example, we may know that the treatment status was determined 1 Another prominent example would be medical studies. It does not matter for the theoretical derivations. Neither does it for the simulations. The real world data that I use comes from economics so I stick to the social sciences context and terminology. 2 Another interesting formulation is the following. What if there is an additional step required to complete the treatment? We know the individuals that were supposed to be treated, but we don t know who actually undertook that step. What can we do in this case? 1

based on a subset of features or that only some of the features affect the distribution of the outcome. In this project I consider both simulated data and a real publicly available dataset that comes from a study by Angrist, Bettinger, Bloom, and King (2002). The results suggest that as both the treatment and the outcome are stochastic the treatment status is hard to retrieve. Not surprisingly, the stronger is the treatment effect the easier it becomes to guess the treatment variable. Intuitively, if the treatment effect is positive and large, a large value of the outcome suggests that the individual was treated. The estimates of the treatment effect are more precisely estimated. As obtaining these estimates is usually the main goal of the researcher, there is some hope that this project wasn t completely useless. The rest of the text is organized as follows. Section 2 describes the statistical model referred to as the Hidden Variable Model. Section 3 reports the results obtained using simulated data. Section 4 discusses the real dataset that I use in Section 5. Section 6 concludes. 2 Model Suppose that for each observation i = 1,..., m there are n feature (or covariates) x (i) = (x (i) 1,..., x (i) n ) T which determine the probability, p(d = 1 x (i) ), of being assigned to the treatment group. The treatment status, d (i), and the covariates jointly determine the distribution of the outcome variable, p(y d (i), x (i) ). The graphical representation of the model is given in Figure 1. Figure 1: Graphical Representation of the Model Covariates (x) Output (y) Treatment Status (d) Here the word means that the covariates, the treatment status and the outcome are stochastic. I assume that only x (i) and y (i) are observed and d (i) is hidden. The likelihood of (y (i), x (i) ) is given by p(y (i), x (i) ) = p(y (i), d x (i) )p(x (i) ). 2

The log-likelihood is log p(y (i), x (i) ) = log = ( ) p(y(i), d x (i) ) + log p(x (i) ) log p(y(i), d x (i) ) + log p(x (i) ) log p(y(i) d, x (i) )p(d x (i) ) + log p(x (i) ), where is a p.m.f. of some distribution over d {0, 1}. The inequality above follows from Jensen s inequality. If we assume that p(d x (i) ) and p(y d, x (i) ) have particular parametric forms, these parameters can be estimated using the expectation-maximization (EM) algorithm. There is one fundamental issue though. We can try to split the dataset into two parts, but it is impossible to say which of these two parts is the treatment group and which is the control group. In practice there is usually some additional knowledge that allows us to make an assumption about the sign of the treatment effect. Here I assume that the treatment effect is positive which allows us to predict that the group with the higher average y (i) is the treatment group. 2.1 Expectation-maximization I assume that p(d x (i) ) = p d and is known. In other words, the assignment to the treatment and control groups is completely random (does not depend on the covariates) and the researcher knows the fraction of people assigned to the treatment group. 3 I also assume that y is binary (the case that corresponds to the real data that I use) and that p(y d, x (i) ) = ( g(θ 0 d + θ T x (i) ) ) y ( 1 g(θ0 d + θ T x (i) ) ) 1 y, where g(s) = 1/(1 + exp ( s)) is the sigmoid function and θ 0, θ = (θ 1,..., θ n ) T are the parameters that we need to estimate. During the E-step of the algorithm I compute p(y (i), d x (i) ). During the M-step I use Newton s method to maximize the lower bound for the log-likelihood. 4 3 Simulations The goal of this section is to assess the performance of the model when we know exactly how the data is generated. Throughout this section n = 10, x (i) 1 = 1 for all i = 1,..., m, 3 It is easy to modify the M-step for the case when p d is unknown. 4 I tried to use gradient ascent too, but it leads to less accurate estimates of the average treatment effect. Moreover, the estimates become very sensitive to the initial values of θ 0, θ. 3

x (i) j N (0, 1) and independent across i = 1,..., m and j = 2..., n. I set p 0 = p 1 = 1/2, θ 1 = 0, θ j = 1 for j = 2,..., 5 and θ j = 1 for j = 6,..., 10. I assume that d (i) and y (i) are generated according to the model described in Section 2. The next two plots show how the misclassification error 5 and the estimate of the average treatment effect 6 behave when θ 0 changes. 7 These plots show that the algorithm fails Figure 2: Misclassification Error and Average Treatment Effect vs. θ 0 Misclassification Error 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 Error (actual) Error (oracle) 0 0 2 4 6 8 10 θ 0 Average Treatment Effect 0.6 0.5 0.4 0.3 0.2 0.1 0 ATE (estimated) ATE (oracle) ATE (true) 0.1 0 2 4 6 8 10 θ 0 to accurately predict the treatment status. Although it becomes easier as the treatment effect increases, even when θ 0 = 10 on average 25% of the observations are misclassified. However, as both the treatment status and the outcome are stochastic, even the oracle model predicts poorly and produces essentially the same misclassification errors as the main model. The estimates of the average treatment effect look much more promising. There are considerable discrepancies when θ 0 is small, but as it increases the estimate of the average treatment effect converge to that of an oracle model. To economize the space I don t report the behavior of the misclassification error and the estimate of the average treatment effect as a function of the sample size. 8 The sample sizes above 1,000 seem to have essentially no effect on the performance. 4 Data The dataset that I use comes from Angrist, Bettinger, Bloom, and King (2002). 9 I use the data from a randomized trial in Colombia conducted in the 90s. The idea of the 5 This is an unsupervised learning problem so the term misclassification error might be confusing. However, the d (i) are actually observed so it is possible to calculate how many times the model has correctly guessed d (i). 6 As in the statistical model that I consider the treatment status is independent of the potential outcome (the randomization assumption), the true average treatment effect is E[y (i) 1 y(i) 0 ] = E[y(i) d (i) = 1] E[y (i) d (i) = 0] and can be easily estimated. 7 In the plots the word oracle refers to the case when we know the true parameters of the model, but do not observe the d (i) s. The word true refers to the case when d (i) s are observed. 8 I presented these results at the poster session. 9 This section is based heavily on the description of the experiment in the paper. 4

experiment was to randomly assign vouchers to the low-income families with children attending public primary schools. These vouchers could be used to pay the tuitions in private secondary schools. The data consist of 1,315 individual level observations. The features include student s gender and age, the city where he or she lives, the year of the lottery (1993, 1995, or 1997), a dummy for whether the student s household has a phone, age and years of schooling of student s parents. There is also a treatment status variable which I want to predict, but I assume that it is not observed by the researcher. The outcome variable which the treatment status is supposed to affect is a dummy for whether the student had finished the 8th grade by the time of the follow up survey. 5 Results The table below reports the performance of the model on the real data. I compare the HVM with the k-means model on the full set of features with k = 2. The true average treatment effect (computed using the actual values of d (i) s) is 0.068. Therefore, the numbers below demonstrate that both the HVM and the k-means perform poorly. As noted in the previous section, when the treatment effect is low the HVM tends to produce inaccurate estimates (which substantially depend on the initial values). Model Misclassification Error Average Treatment Effect k-means 0.499 0.353 Hidden variable 0.465 0.002 6 Conclusion The results suggest that the Hidden Variable Model allows us to estimate the average treatment effect and tends to perform somewhat better than a simple k-means model. However, it fails to accurately predict the treatment status and the estimate of the average treatment effect depends on the initial values. To improve the performance of the model I would like to add more (derived) features and consider alternative specifications of p(y d, x (i) ). References Angrist, J., E. Bettinger, E. Bloom, and E. King (2002). Vouchers for private schooling in colombia: Evidence from a randomized natural experiment. The American Economic Review. Kuroki, M. and J. Pearl (2014). Measurement bias and effect restoration in causal inference. Biometrika 101 (2), 423 437. 5