Deep Learning for Causal Inference

Size: px

Start display at page:

Download "Deep Learning for Causal Inference"

Gordon Wells
5 years ago
Views:

1 Deep Learnng for Causal Inference Vkas Ramachandra Stanford Unversty Graduate School of Busness 655 Knght Way, Stanford, CA Abstract In ths paper, we propose the use of deep learnng technques n econometrcs, specfcally for causal nference and for estmatng ndvdual as well as average treatment effects. The contrbuton of ths paper s twofold: 1.For generalzed neghbor matchng to estmate ndvdual and average treatment effects, we analyze the use of autoencoders for dmensonalty reducton whle mantanng the local neghborhood structure among the data ponts n the embeddng space. Ths deep learnng based technque s shown to perform better than smple k nearest neghbor matchng for estmatng treatment effects, especally when the data ponts have several features/covarates but resde n a low dmensonal manfold n hgh dmensonal space. We also observe better performance than manfold learnng methods for neghbor matchng. 2. Propensty score matchng s one specfc and popular way to perform matchng n order to estmate average and ndvdual treatment effects. We propose the use of deep neural networks (DNNs) for propensty score matchng, and present a network called PropenstyNet for ths. Ths s a generalzaton of the logstc regresson technque tradtonally used to estmate propensty scores and we show emprcally that DNNs perform better than logstc regresson at propensty score matchng. Code for both methods wll be made avalable shortly on Gthub at: 1.The problem of causal nference We consder a setup where there are n unts or data ponts, ndexed by = (1,..., n). We postulate the exstence of a par of potental outcomes for each unt, ( Y (0), Y (1)) (followng the potental outcome or Rubn Causal Model [4]), wth the unt-level causal effect defned as the dfference n potental outcomes, T = Y (1) Y (0). Let W {0, 1} be the bnary ndcator for the treatment, wth W = 0 ndcatng that unt receved the control treatment, and W = 1 ndcatng that unt receved the actve treatment. The realzed outcome for unt s the potental outcome correspondng to the treatment receved: Y (obs) = Y (W ) = Y (0) f W = 0, Y (1) f W = 1. Let X be a N-component vector of features, covarates or pretreatment varables, known not to be affected by the treatment. Our data consst of the trple ( Y (obs), W, X ), for = (1,..., n), whch are regarded as an..d sample drawn from a large populaton. We assume that observatons are exchangeable, and that there s no nterference (the stable unt treatment value assumpton, or sutva).

2 Snce we cannot observe the counterfactual for any partcular x unt, one way to estmate the treatment effect for each unt wll be by usng values from ts neghbors whch receved the opposte treatment, and by takng the dfference between the two outcomes. Ths ndvdual treatment effect ITE can be wrtten as: I T E = T (estmated) = Y (1) Y neghbor (0), f W = 1, and ( Y (0) Y neghbor (1)), f W = 0 There are dfferent technques to determne the neghbors n the above construct, and we wll look at two such methods: 1. Generalzed neghbor matchng as well as 2. propensty score based matchng, and ntroduce deep learnng based models to do both types of matchng. 2. Neghbor matchng to estmate ndvdual and average treatment effects As dscussed above, the mssng counterfactual data problem can be addressed (under certan assumptons [2]) by matchng each unt whch dd not receve treatment (W=0) wth ts nearest unt from the group that receved treatment (W=1) for the bnary treatment case. There are varous technques whch have been used for matchng, propensty score based matchng[3] as well as generalzed neghbor matchng [1] (usng clusterng, spectral clusterng and manfold learnng methods). 2.1 Propensty score matchng One of the most popular technques for matchng s by usng propensty scores [2][3], as brefly descrbed below. In the statstcal analyss of observatonal data, propensty score matchng (PSM) s a statstcal matchng technque that attempts to estmate the effect of a treatment, polcy, or other nterventon by accountng for the covarates that predct recevng the treatment. PSM attempts to reduce the bas due to confoundng varables that could be found n an estmate of the treatment effect obtaned from smply comparng outcomes among unts that receved the treatment versus those that dd not. The technque mplements the Rubn causal model for observatonal studes. The possblty of bas arses because the apparent dfference n outcome between these two groups of unts may depend on characterstcs that affected whether or not a unt receved a gven treatment nstead of due to the effect of the treatment per se. In randomzed experments, the randomzaton enables unbased estmaton of treatment effects; for each covarate, randomzaton mples that treatment-groups wll be balanced on average, by the law of large numbers. Unfortunately, for observatonal studes, the assgnment of treatments to research subjects s typcally not random. Matchng attempts to mmc randomzaton by creatng a sample of unts that receved the treatment that s comparable on all observed covarates to a

3 sample of unts that dd not receve the treatment, and these two matched groups can be used to estmate the average or ndvdual treatment effect (by takng a dfference between the outcomes of the two matched groups or unts.) PSM s for cases of causal nference and smple selecton bas n non-expermental settngs n whch: () few unts n the non-treatment comparson group are comparable to the treatment unts; and () selectng a subset of comparson unts smlar to the treatment unt s dffcult because unts must be compared across a hgh-dmensonal set of pretreatment characterstcs. PSM employs a predcted probablty of group membershp e.g., treatment vs. control group based on observed predctors, usually obtaned from logstc regresson to create a counterfactual group. Tradtonal procedure for Propensty score matchng s as follows: 1. Run logstc regresson: Dependent varable: Y = 1, f partcpate; Y = 0, otherwse. Choose approprate confounders (varables hypotheszed to be assocated wth both treatment and outcome) Obtan propensty score: predcted probablty (p) or log[p/(1 p)]. 2. Check that propensty score s balanced across treatment and comparson groups, and check that covarates are balanced across treatment and comparson groups wthn strata of the propensty score. 3. Match each partcpant to one or more nonpartcpants on propensty score: Tradtonally, nearest neghbor matchng s used. 2.2 Generalzed neghbor matchng: It has been shown that wth ncreasng dmensons, propensty score based nearest neghbor matchng has ncreasng bas [2]. To overcome ths problem, varous alternatves have been proposed n the lterature to propensty score matchng, such as usng random projectons [1] and spectral clusterng and local lnear embeddngs [6]. These technques work well when the data ponts span a lower dmensonal manfold n hgher dmensonal space. Our contrbutons:

4 1.In ths paper, we use deep learnng based autoencoders for generalzed neghbor matchng, for estmaton of treatment effect for each data pont. We compare the error n estmated treatment usng our method wth k nearest neghbors, as well as manfold learnng technques, for smulated datasets, and verfy that autoencoder based dmensonalty reducton and neghbor matchng gves lesser error and a better low dmensonal representaton compared to k nearest neghbors as well as manfold learnng methods. 2. In the case of usng the propensty score based method for matchng, we also propose the use of deep neural networks (DNNs) for step 1 above n leu of tradtonal logstc regresson, for propensty score estmaton, and we present results for smulated datasets to verfy the superor performance of the proposed DNN, PropenstyNet for ths task. 3.Autoencoders for generalzed neghbor matchng An autoencoder s an artfcal neural network used for unsupervsed learnng of effcent codngs of the nput data [7]. The am of an autoencoder s to learn a representaton (encodng) for a set of data, typcally for the purpose of dmensonalty reducton. 3.1 Deep learnng based clusterng: Autoencoders Archtecturally, the smplest form of an autoencoder s a feedforward, non-recurrent neural network very smlar to the multlayer perceptron (MLP) havng an nput layer, an output layer and one or more hdden layers connectng them but wth the output layer havng the same number of nodes as the nput layer, and wth the purpose of reconstructng ts own nputs (nstead of predctng the target value). Therefore, autoencoders are unsupervsed learnng models. An autoencoder always conssts of two parts, the encoder and the decoder, whch can be defned as transtons, ( ϕ, ψ) such that: ϕ : X > F ψ : F > X, encoder, decoder ( ϕ, ψ) : a rgmn (ψ,ϕ) X ( ψ * ϕ )X, n the L-2 norm sense

5 The nonlnear functonal mappngs for the encoder and decoder are learnt to mnmze the reconstructon error above. The learnt mappng, f t maps the nput to a lower dmensonal encodng, becomes a form of non-lnear dmensonalty reducton technque. The tranng algorthm for an autoencoder can be summarzed as For each nput x, Do a feed-forward pass to compute actvatons at all hdden layers, then at the output layer to obtan an output x Measure the devaton of x from the nput x (typcally usng squared error), Backpropagate the error through the net and perform weght updates. Repeat the above steps for several epochs untl the error reaches below a certan threshold or converges. 3.2 Autoencoders for neghbor matchng We buld an autoencoder wth the followng structure. If the nput data as N dmensons, the frst and last layers of the autoencoder have N neurons, Our am s to reduce the dmensons, to M, so the mddle layers of the autoencoder has M neurons, as shown n the fgure below (left). In the case of our smulated dataset, we have N=3, M=2, and number of data ponts =1500. Ths means that the encoder weghts wll be 2x1500 and the decoder weghts wll be 1500x2, snce the hdden dmenson M=2. The tranng process wll try and learn the weghts n an teratve fashon, usng mean squared error loss functon gradent backpropagaton.

6 Fgure: Left: The autoencoder network, Rght: the tranng mean squared error at each epoch. In the M dmensonal space, ndvdual treatment effect (ITE) s calculated as the dfference n the outcomes of the present unt (f treated) and ts untreated neghbor(s) n that space, usng Eucldean dstance n M dm. space to dentfy neghbors. ITE: (Y_unt_treated - Y_neghbor M_dm_untreated). The above expresson for ITE s smlar for manfold learnng technques, the man dfference beng the way we get the mappng to the reduced M dmensonal space usng manfold learnng versus autoencoder technques. 3.3 Experments and results for generalzed neghbor matchng We smulate a dataset as follows ponts are generated, n 3 dmensons. A swss roll functon s appled so that the ponts le along a 2D manfold n 2D space, as shown n the fgure below. The generatng functon f(x) for the Swss roll s: n = 1500; t = 3 * π /2 * [1 + 2 * r and(n)] ; h = 11 * r and(n); f (x) = [t * cos(t), h, t * sn (t)] + n ose The data s also splt nto 6 groups based on the dstances from neghbors along manfolds, as shown n the fgure below. For each data pont, we assgn a bnary treatment varable W=0 or 1, & also outcome Y values as a smple lnear combnaton of the x covarates, usng 2 dfferent functons based on W=0 or 1. Then, we project the dataset onto lower dmensons (M=2) usng A. autoencoders and B. Manfold learnng.

Fgure: K-means clusterng n the orgnal space for the Swss roll: t can be seen easly that the algorthm does not learn the

7 Fgure: Orgnal Swss roll dataset n 3 dmensons used for smulatons. Colors show the 6 classes the data was splt nto for the smulaton. Fgure: K-means clusterng n the orgnal space for the Swss roll: t can be seen easly that the algorthm does not learn the manfold nature of the data, and puts far off manfold ponts from dfferent classes nto the same group (color): e.g. the sky blue ponts. Smlarly, k nearest neghbors performs poorly because t does not learn the structure of the data manfold.

When we project the data to 2D space, we vsualze the projectons. The fgure below shows the dmensonalty reducton usng A. Prncpal component analyss (PCA), B.

8 When we project the data to 2D space, we vsualze the projectons. The fgure below shows the dmensonalty reducton usng A. Prncpal component analyss (PCA), B. manfold learnng (center) based on matrx factorzaton, and C. Autoencoder. It s clear that both B. and C. do a good job at learnng the structure of the data, unlke PCA, thus a k nearest neghbors n reduced space usng Eucldean dstance (smlar to a PCA decomposton) performs poorly, as t dd n the orgnal dmensons. Next, to compare B. manfold learnng and C. autoencoders, We also compute the estmated treatment effect for each pont (ITE), and the average absolute error of ITE for B. manfold learnng and C. Autoencoder, over all the data ponts n the test set. Mean Absolute error (ITE,autoencoder: , Mean absolute error (ITE, Manfold learnng): Thus, autoencoder error s 20.27% lesser than manfold learnng estmate for the ITE. Fgure: Clusterng and dmensonalty reducton usng varous methods (Same color mples same orgnal assgned group n smulated data) Left: PCA Center: Manfold learnng Rght: Autoencoders. The output from autoencoder also gves the least error n the estmated treatment effect across all unts. 4. Deep neural networks (DNNs) for propensty score matchng In the above secton, we showed how autoencoders for generalzed neghbor matchng. In ths secton, we show how deep neural networks for classfcaton can be leveraged to do propensty score matchng, specfcally to replace logstc regresson descrbed n secton 2.1

4.1 Deep neural networks for classfcaton A deep neural network (DNN) s an artfcal neural network (ANN) wth multple hdden layers between the nput and output layers.

9 4.1 Deep neural networks for classfcaton A deep neural network (DNN) s an artfcal neural network (ANN) wth multple hdden layers between the nput and output layers. DNNs can model complex non-lnear relatonshps and can be used for both classfcaton and regresson tasks [8]. DNN archtectures generate compostonal models where the object s expressed as a layered composton of prmtves. The extra layers enable composton of features from lower layers, potentally modelng complex data wth fewer unts than a smlarly performng shallow networks or models. DNNs are feedforward networks n whch data flows from the nput layer to the output layer wthout loopng back. For classfcaton, the last layer of the network s a softmax layer, whch outputs the probablty of each class. The ntermedate layers can be of any form, and the output of each layer s typcally passed through a non-lnear functon. We can learn the parameters of the classfcaton DNN by usng a labeled tranng dataset, whch each data pont or unt has a ground truth label. A cost functon s specfed (such as msclassfcaton error), and the error s back-propagated through the network, to update the weghts along the gradent drectons teratvely, untl we acheve a low error. The learnng typcally happens n steps for batches of the data (stochastc gradent descent). The fgure below shows an example of the general DNN. Fgure: A general fully connected DNN, for classfcaton. 4.2 PropenstyNet: Experments and results for DNN based propensty score matchng We buld a DNN PropenstyNet to estmate the propensty score, wth the nputs beng the covarates X as well as the outcome Y across all unts. The data s splt nto tranng and cross

10 valdaton folds and categorcal cross entropy s used as an error metrc (t gves a measure of label msclassfcaton). We use adadelta as the optmzer algorthm. Ths DNN PropenstyNet s tryng to solve a bnary classfcaton problem, snce the treatment varable W s bnary. The output.e. the last layer (softmax) of the traned network gves us a probablty between 0 and 1 for each new/test unt, whch s the propensty score. As such, ths can be thought of as a generalzaton of the logstc regresson functon. PropenstyNet s a fully connected network smlar to the above fgure, where every neuron n a gven layer s connected to every other neuron n the next layer. The structure of PropenstyNet s gven below. Fgure: PropenstyNet deep neural network model structure As can be seen above, PropenstyNet has 5 dense (fully connected) layers. Each layers also has a dropout of 30%, whch s a way to avod overfttng for DNNs. The output layer s a softmax layer, and gves probablty of beng n one of the 2 classes (treatment W=1 or 0), whch s the propensty score. We have a total of 382 parameters to be traned n the network. The model was bult usng Keras wth Tensorflow backend n R. We buld a smulated dataset as follows data ponts/ unts were smulated, wth 2 covarates dawn from a unform dstrbuton, the outcome Y was also randomly drawn from a unform dstrbuton, and all unts assgned to treatment W=1. Thus, we know the ground truth nearest pont/neghbor from W=1 for each pont

11 n W=0. The unt covarates and outcomes were jttered to get another 1000 unts, whch were assgned treatment W=0. A logstc regresson/logt model was bult usng W~ X+Y, and the PropenstyNet was also traned wth W as output, and (X,Y) as nputs for each unt. For both models, we then calculate the assgnment error (How far s the test unt assgned on an average from ts ground truth neghbor, as well as number of ms-assgnments based on estmated propensty score). PropenstyNet gave a smaller number of ms-assgnments (6% better) as well as a smaller mean absolute msassgnment error (12% better, as a percent of ground truth true ndex of each unt), as well as better accuracy (8% better), compared to logstc regresson model, as shown below. Model Mean absolute msclassfcaton error(%) Number of ms-assgnments (%) Accuracy(%) Logstc regresson PropenstyNet (DNN) Table: Varous error metrcs used to compare the proposed PropenstyNet wth tradtonal logt In the fgure below, we plot the control and treatment unts based on PropenstyNet, to confrm ts good performance vsually. Fgure: Plot of a subset of control ponts (pnk), matched (usng the PropenstyNet output scores) wth ther neghbors whch are treated (blue). It s clear vsually that the ponts are matched well. Y-axs s one covarate and X-axs s propensty score.

12 5. Dscusson and concluson Recently, there have been several efforts to leverage machne learnng technques for causal nference problems, ncludng estmatng heterogeneous treatment effects [5], propensty score modelng as well as neghbor matchng [1] for ndvdual treatment effects. Our am s to contrbute to ths contnung effort, by addng deep learnng technques to the feld of causal nference and econometrcs n general. In ths paper, we have shown how one can use autoencoders for dmensonalty reducton and performng neghbor matchng n feature space. We have also bult a deep neural network classfer PropenstyNet to do propensty score based matchng to estmate ndvdual and average treatment effects. The accuracy of both algorthms was verfed on smulated datasets. Future work would be to run these algorthms on real world datasets, as well as further leveragng newer deep learnng models for causal nference and econometrcs. Code for both algorthms wll be made avalable shortly on Gthub at ths locaton: Acknowledgement We would lke to thank Prof. Susan Athey and Prof. Gudo Imbens at the Stanford Unversty GSB for several llumnatng dscussons about causal nference, treatment effects and econometrcs. References [1] Matchng va Dmensonalty Reducton for Estmaton of Treatment Effects n Dgtal Marketng Campagns; Sheng L, Nkos Vlasss, Jaya Kawale, Yun Fu, [2] Large sample propertes of matchng estmators for average treatment effects; Alberto Abade and Gudo W Imbens, 2001 [3] The central role of the propensty score n observatonal studes for causal effects; Paul R Rosenbaum and Donald B Rubn, [4]Estmatng causal effects of treatments n randomzed and nonrandomzed studes; Donald Rubn, 1974 [5] Recursve Parttonng for Heterogeneous Causal Effects; Susan Athey and Gudo W. Imbens, 2015 [6] Robust Propensty Score Computaton Method based on Machne Learnng wth Label-corrupted Data; Chen Wang, Suzhen Wang, Fuyan Sh, Zaxang Wang, [7] Reducng the dmensonalty of data wth neural networks; G. Hnton and R Salakhutdnov, 2006 [8] Imagenet classfcaton wth deep convolutonal neural networks; A. Krzhevsky, I, Sutskever, G. Hnton, 2012.

Supporting Information

Supporting Information Supportng Informaton The neural network f n Eq. 1 s gven by: f x l = ReLU W atom x l + b atom, 2 where ReLU s the element-wse rectfed lnear unt, 21.e., ReLUx = max0, x, W atom R d d s the weght matrx to