RNN-based counterfactual time-series prediction

Size: px

Start display at page:

Download "RNN-based counterfactual time-series prediction"

Clement Cunningham
6 years ago
Views:

1 RNN-based counterfactual time-series prediction Jason Poulos May 11, 2018 arxiv: v4 [stat.ml] 11 May 2018 Abstract This paper proposes an alternative to the synthetic control method (SCM) for estimating the effect of a policy intervention on an outcome over time. Recurrent neural networks (RNNs) are used to predict counterfactual time-series of treated unit outcomes using only the outcomes of control units as inputs. Unlike SCM, the proposed method does not rely on pre-intervention covariates, allows for nonconvex combinations of control units, can handle multiple treated units, and can share model parameters across time-steps. RNNs outperform SCM in terms of recovering experimental estimates from a field experiment extended to a time-series observational setting. In placebo tests run on three different benchmark datasets, RNNs are more accurate than SCM in predicting the post-intervention time-series of control units, while yielding a comparable proportion of false positives. The proposed method contributes to a new literature that uses machine learning techniques for data-driven counterfactual prediction. Graduate Student, Department of Political Science, University of California, Berkeley. Contact poulos@berkeley.edu.

2 1 Introduction An important problem in the social sciences is estimating the effect of a policy intervention on an outcome over time. When interventions take place at an aggregate level (e.g., city or state), researchers make causal inferences by comparing the post-intervention outcomes of affected ( treated ) units against the outcomes of unaffected units ( controls ). The synthetic control method (SCM) (Abadie, Diamond, and Hainmueller 2010) is a popular method for making causal inferences on observational time-series. The method compares a single treated unit outcome with a synthetic control that combines the outcomes of multiple control units on the basis of their pre-intervention similarity with the treated unit. Specifically, the synthetic control is constructed by choosing a convex combination of weights w = (w 1,..., w j ), w j 0, w j = 1, of control time-series that minimizes X 1 X 0 w v, where X 1 and X 0 are the pre-intervention covariates of the treated and control units, respectively, and v = (v 1,..., v J ) are importance weights chosen to minimize the prediction error produced by w v during cross-validation. The SCM has several limitations. First, the formation of the synthetic control is restricted to a convex combination of controls. Ferman and Pinto (2016) point out that this restriction implies that the SCM estimator may be biased even if selection into treatment is only correlated with time-invariant unobserved covariates. Second, the specification of the estimator can produce very different results. Ferman, Pinto, and Possebom (2018) show how cherry-picking between common SCM specifications can facilitate p-hacking. 1 Third, SCM cannot handle multiple treated units. This paper proposes an alternative to SCM that does not rely on pre-intervention covariates, allows for nonconvex combinations of control units, and can handle multiple treated units. The method uses recurrent neural networks (RNNs) for counterfactual time-series 1. Several problems arise from the lack of guidance on how to specify the SCM estimator. Kaul et al. (2015) show, for instance, that the common practice of including lagged versions of the outcome variable as separate predictors can render all other covariates irrelevant. Klößner et al. (2017) show that cross-validation can yield multiple values of importance weights, and consequently different results. 1

3 prediction, using only the outcomes of control units as model inputs. RNNs are a class of neural networks that take advantage of the sequential nature of time-series data by sharing model parameters across multiple time-steps (El Hihi and Bengio 1995). RNNs are capable of not only handling multiple treated units, but also sharing learned model weights when predicting the outcomes of multiple treated units. RNNs that are sufficiently deep can learn the most useful nonconvex combination of control unit outcomes at each time-step for generating counterfactual predictions. RNNs have been shown to outperform various linear models on time-series prediction tasks (Cinar et al. 2017). The proposed method builds on a new literature that uses machine learning techniques such as matrix completion (Athey et al. 2017) and Lasso (Doudchenko and Imbens 2016) for data-driven counterfactual prediction. Results from empirical applications demonstrate that RNNs outperform SCM in terms of recovering experimental estimates from an actual field experiment extended to a time-series observational setting. RNNs are also more accurate than SCM in predicting the post-intervention time-series of control units in a series of placebo tests. In Section 2, I describe the approach of using RNNs for counterfactual time-series prediction and details the procedure for evaluating the models in terms of predictive accuracy and statistical significance; Sections 3 and 4 evaluate the models in empirical applications; and Section 5 discusses when the proposed method is expected to outperform SCM. 2 RNNs for counterfactual prediction The proposed method estimates the causal effect of a discrete intervention in observational time-series data; i.e., settings in which treatment is not randomly assigned and there exists both pre- and post-intervention period observations of the outcome of interest. Brodersen et al. (2015) originally propose an alternative to SCM that uses the pre-period time-series of control units to train a model to predict the counterfactual time-series of the treated unit. 2

4 The key assumption of this approach is that the relationship between predictors x (1),..., x (τ) and the treated time-series y (1),..., y (τ) modeled prior to the intervention persists after the intervention. The approach also assumes that the control units are not themselves affected by the intervention; i.e., there is no spillover effect that would indeterminately bias the estimate and possibly lead to a false positive. In its most basic form, the pre-period relationship can be modeled by linear regression, y (t) = α 0 + βx (t) + ɛ (t) t = 1,..., n, where time-step t = 1,..., n,..., τ is the temporal ordering of the time-series and n denotes the end of the pre-period. As long as x (t) was not impacted by the intervention, it is plausible that the modeled relationship persists after the intervention. The fitted model is then used to predict the counterfactual time-series of the treated group in the post-period: ŷ (t) = α 1 + βx (t) + ζ (t) t = n , τ, The inferred treatment effect is the difference between the observed time-series of the treated units and the counterfactual time-series that would have been observed in the absence of the intervention: ˆφ (t) = y (t) ŷ (t) t = n + 1,..., τ. (1) 2.1 Predictive accuracy and statistical significance In applications in which the counterfactual time-series is known (i.e., placebo tests) models can be evaluated in terms of the MSPE between the predicted and actual post-intervention time-series among control units. Specifically, I calculate: MSPE = 1 τ n 3 τ n+1:τ ( ) 2 ˆφ(t), (2)

5 and the standard deviation η of the MSPE distribution over all control units is used to form corresponding intervals, [MSPE z α/2 η, MSPE+z α/2 η], where z α/2 is the upper α/2 quantile of the standard normal distribution. Eq. 2 measures the accuracy of the counterfactual predictions, and consequently the accuracy of the estimated treatment effect. However, this metric does not tell us anything about the statistical significance of estimated treatment effects. ADH (2010) propose a randomization inference approach for calculating the exact distribution of placebo effects under the null hypothesis. Following Cavallo et al. (2013), I extend this method to the case of multiple (placebo) treated units by constructing a distribution of average placebo effects following the procedure described in OA-Section 1. A single false positive rate is calculated by FPR = FP, where FP is the number of false positives defined as the number of (τ n) J p-values less than or equal to α = RNNs RNNs consist of a hidden state h (t) and an output y (t) which operate on a sequence x (t). At each time-step t, RNNs input x (t) and pass it to the hidden state, which is updated with a function g (t) using the entire history of the sequence (pp. 337 Goodfellow, Bengio, and Courville 2016): h (t) = g (t) ( x (1),..., x (t)) (3) = f ( h (t 1), x (t)), (4) where f( ) is a nonlinear function that operates on all time-steps and input lengths. The updated hidden layer is used to generate a sequence of output values o (t) in the form of log probabilities that correspond to x (t). The loss function computes ŷ (t) = linear(o (t) ) and compares this value to y (t). In the empirical applications, I calculate the loss in terms of 4

6 mean squared prediction error, L (t) MSPE = E [ (y (t) ŷ (t)) 2 ]. 2.3 Encoder-decoder networks A special variant of RNNs that are suitable for handling variable-length sequential data are encoder-decoder networks (Cho et al. 2014). Encoder-decoder networks are the standard for neural machine translation (Bahdanau, Cho, and Bengio 2014; Vinyals et al. 2014) and are also widely used for predictive tasks, including speech recognition (Chorowski et al. 2015) and time-series forecasting (Zhu and Laptev 2017). Encoder-decoder networks are trained to estimate the conditional distribution of the output sequence given the past input sequence, e.g., p(y (1),..., y (t) x (1),..., x (t) ), where the input and output sequence lengths can differ. The encoder RNN reads in x (t) sequentially and the hidden state of the network updates according to Eq. 3. The hidden state of the encoder is a context vector c that summarizes the input sequence, which is copied over to the decoder RNN. The decoder generates a variable-length output sequence by predicting y (t) given the encoder hidden state and the previous element of the output sequence. Thus, the hidden state of the decoder is updated recursively by h (t) = f ( h (t 1), y (t 1), c ), (5) and the conditional probability of the next element of the sequence is P (y (t) y (1),..., y (t 1), c) = f ( h (t), y (t 1), c ). Effectively, the decoder learns to generate outputs y (t) given the previous outputs, conditioned on the input sequence. 5

7 3 Application: Mayoral elections field experiment Panagopoulos and Green (2008a) (hereafter, PG ) randomly assigned get-out-the-vote radio ads to cities holding mayoral elections in 2005 and Random assignment occurred within matched pairs, with treated cities exposed to 50, 70, or 90 gross ratings points (GRPs) of radio advertising, and control cities exposed to 0 GRPs. The study investigates whether radio ads increase electoral competitiveness, defined as the percentage difference between the firstand second-place candidates. The authors estimate a negative effect of ad spending on vote margin, which is consistent with the hypothesis that ads increase electoral competitiveness. However, the treatment effect is only significant when estimated with a linear model that includes covariates such as past turnout. I extend the experimental data to an observational time-series setting by assembling a dataset of city-year observations of winner margin in mayoral elections. The primary source of data is the Mayoral Elections Survey Data (Ferreira and Gyourko 2017), which are described in Ferreira and Gyourko (2009). These data include winner, runner-up, and total vote counts used to construct the winner margin variable for 6,211 elections in 842 cities. The second data source is the replication data (Hopkins 2016) for Hopkins and Pettingill s (2017) study, which consists of 341 elections in 115 cities. Finally, I include winner margin data from the 74 experimental cities observed in 2005 and 2006 (P&G 2008b). 2 The assembled (unbalanced) panel dataset consists of winner margins from 6,578 elections in 934 cities, spanning 1945 to Experimental estimates Experimental estimates are obtained by estimating the following geo-based regression (GBR) model, proposed by Vaver and Koehler (2011) for evaluating location-targeted advertising experiments: 2. When datasets have overlapping city-year observations of winner margin, I choose the dataset of Ferreira and Gyourko (2017) because it has the best coverage. When a city has multiple elections in a year (e.g., to fill a vacancy), I take the mean across elections. 6

8 y s,1 = β 0 + β 1 y s,0 + β 2 τ s + β 3 strata s + µ s, (6) where y s,1 and y s,0 is the log of mean vote margin during the post- and pre-intervention periods for city s, respectively; τ s is the post-period level of ad spending in GRPs; and µ s is the error term. strata s is a vector of price strata dummies that control for the fact that the random assignment of ad spending was conducted within price strata. Since τ s is randomly assigned conditional on price strata, the ˆβ 2 captures the effect of ad spending on vote margin. Eq. 6 is estimated on balanced panels of PG experimental cities that have both pre- and post-period vote margin observations, using weighted least squares with weights w s = 1/y s,0 to control for heteroscedasticity due to differences in city size. 3 The standard errors for ˆβ 2 are estimated with non-parametric bootstrap stratified by price strata. The experimental estimates and corresponding 95% confidence intervals are presented in Panel A of Table 1. The log-level regression estimate on the pooled panel containing both 2005 and 2006 elections implies that each one-point GRP purchase increases the winner s margin by 0.05% [-0.8%, 1%]. In comparison, the pooled treatment effect estimate in PG is -0.11% [-0.26%, 0.03%]. 3.2 Observational estimates In this application, y (t) is a multivariate time-series that takes the value of winner margins in PG treated cities and x (t) is a multivariate time-series of J = 778 predictors consisting of both PG control cities and other non-treated cities, with missing predictor values imputed via linear interpolation. There are 47 pre-period time-steps (1948 to 2004) and two post-period time-steps (2005 and 2006). In this application, RNNs are trained on five input-output sequence pairs that are offset by one time-step into the future, with the last pair reserved for model validation. I train 3. Following the authors specifications, I include a year dummy in the pooled sample regression. In contrast to the authors specifications, I do not restrict the samples to cities with incumbents running against at least one opposing candidate in order to preserve sample sizes. 7

9 both encoder-decoder networks and a baseline RNN in the form of a single Long Short-Term Memory (LSTM) (Schmidhuber and Hochreiter 1997) on the observational panel dataset. In addition to using regularization techniques such as dropout and L2 loss, I check-point each RNN model throughout its training history and use the network weights with the lowest validation error to generate counterfactual predictions. The treatment effect estimates on observational data are calculated by Eq. 1 and presented in Panel B of Table 1. With respect to the MSPE between predicted and experimental estimates of the treatment effects, encoder-decoder networks outperform the synthetic control, which is constructed using winner margin means over the pre-period. The pooled encoder-decoder networks estimate implies that each one-point GRP purchase increases the winner s margin by 4% [2%, 6%], which overestimates the true experimental effect and leads to a false positive. LSTM and SCM also overestimate the true experimental effect, although the corresponding confidence intervals are wide enough that the estimates do not yield false positives. The LSTM performs particularly poorly on this dataset due to the fact that the network is too shallow to take advantage of the high-dimensional set of predictors. Model ˆφ (2005) ˆφ (2006) ˆφ (pooled) MSPE (2005) MSPE (2006) MSPE (Pooled) Panel A: Experimental GBR [-0.01, 0.01] [ -0.02, 0.02] [-0.008, 0.01] Panel B: Observational Encoder-decoder [ 0.009, 0.05] [0.04, 0.07] [0.02, 0.06] LSTM [-4.69, 7.28] [-7.52, 9.98] [-6.11, 8.63 ] SCM [-0.84, 0.69] [-0.91, 0.69] [-0.88, 0.69 ] Table 1: Comparison of experimental and observational estimates of causal impact of advertising campaign on log winner margin in PG treated cities. Experimental estimates are obtained by estimating Eq. 6 via weighted least squares on a balanced pre- post-period sample restricted to experimental cities (N = {31, 17, 48}). For experimental estimates, values in brackets represent 95% bootstrap confidence intervals constructed with 1,000 stratified bootstrap samples. For observational estimates, 95% randomization confidence intervals are calculated following the procedure described in Section OA-1.1. The shaded cells indicate the model with the lowest MSPE between the estimated treatment effect on the observational data and the corresponding experimental estimate. 8

10 4 Application: SCM placebo tests In a second set of analyses, I evaluate the proposed RNN-based approach on three datasets common to the SCM literature. In each dataset, I remove the actual treated unit and evaluate predictive accuracy on control units. The treatment effects on the control units are expected to be about zero, which would yield a low MSPE. 4.1 Basque Country Abadie and Gardeazabal (2003) estimate the economic impact of terrorism in the Basque Country during the period of 1968 to 1997 by comparing per-capita gross domestic product (GDP) in the Basque Country against a synthetic control region without terrorism. The synthetic control is constructed using pre-period measures of illiteracy, educational attainment, investment, and means of GDP. The time series begins in 1955, which leaves only n = 14 pre-period time-steps. Fig. OA-5 plots estimated treatment effects on control units for each model. We observe considerable variability in the SCM estimates and not as much variability in the RNNs. This is explained by the fact that SCM cannot handle multiple (placebo) treated units and thus a separate model has to be run for each (placebo) treated unit. RNNs can handle multiple treated units and thus benefit from parameter sharing across treated units. The standard deviation from the mean MPSE, which is reported in the first column of Panel A of Table 2, indicates that SCM is comparatively more variable in terms of its predictive accuracy. The baseline LSTM yields the lowest mean MSPE, ± Fig. OA-6 plots the per-period randomization p-values corresponding to treatment effects on treated and control units. p-values corresponding to treatment effects on the actual treated unit (i.e., Basque Country) are made by comparing the per-period treatment effects on Basque Country against the null distribution of average placebo effects following the procedure outlined in Section OA-1.1. SCM has the comparatively lowest FPR (Panel B of 9

11 Table 2: Evaluation metrics on SCM placebo tests. Panel A: MSPE Basque Country California Elections West Germany Encoder-decoder 0.01 ± ± ± ± 0.01 LSTM ± ± ± ± 0.01 SCM ± ± ± ± 0.11 Panel B: FPR Basque Country California Elections West Germany Encoder-decoder LSTM SCM Panel A: Comparison of error rates for predicting post-period time-series for control units. Error bars represent ± one standard deviation from the MSPE. Panel B: FPRs when predicting post-period time-series for control units. Table 2) among all models: the model falsely rejects the null hypothesis about a quarter of the time California ADH (2010) applies SCM to estimate the effects of a large-scale tobacco control program implemented in California in The study spans the period of 1970 to 2000, providing n = 19 pre-period time-steps. The synthetic control is constructed using pre-period covariates including income, beer sales, demographics, and means of the dependent variable, which is log per-capita cigarette consumption. Fig. OA-8 shows that for RNNs, control unit treatment effects are tightly centered around zero as expected whereas SCM control treatment effects are more dispersed. Encoderdecoder networks yield the lowest mean MSPE, ± 0.004, while yielding comparatively higher FPR. 4. Note that the p-values are not adjusted for multiple comparisons. 10

12 4.3 West Germany Lastly, ADH (2015) constructs a synthetic West Germany in order to estimate the impact of the 1990 German reunification on log real per-capita GDP during a post-period that extends to The time-series begins in 1960, which leaves n = 30 pre-period time-steps. The synthetic control is constructed using pre-period means of GDP, trade, industry, schooling, and investment. SCM and RNN-based approaches both assume the absence of spillover effects. ADH (2015) acknowledge that spillover effects is a valid concern in their study because German reunification likely have effects on GDP in the 16 OECD member countries that serve as controls. Indeed, Fig. OA-11 shows that RNNs estimate mostly positive (and increasing) treatment effects on control units, which suggests that estimate of the treatment effect on West Germany is biased upward. Overall, the baseline LSTM yields the lowest error in terms of mean MSPE, 0.02 ± 0.01, with a FRP comparable to SCM. 5 Discussion This paper proposes a novel alternative to SCM, which is growing in popularity in the social sciences despite its limitations; the most obvious being that the choice of specification can lead to different results, and thus facilitate p-hacking. Since RNNs input only control unit outcomes, and do not rely on pre-period covariates, the proposed method offers a more principled approach than SCM. RNNs are also capable of handling multiple treated units and can learn nonconvex combinations of control units. The former attribute is useful because the model can share parameters across treated units, and thus generate more precise predictions in settings in which treated units share similar data-generating processes. The latter attribute is beneficial when the data-generating process underlying the outcome of interest depends nonlinearly on the history of its inputs. 11

13 In the first empirical application, I extend experimental data from a field experiment that investigates the effect of randomized radio ads on electoral competition to an observational time-series setting. I find that encoder-decoder networks outperform both LSTM and SCM in recovering the ground-truth experimental estimate, although its comparatively narrow confidence intervals yield a false positive. The LSTM and SCM also overestimate the true experimental estimate, but have wider confidence intervals that contain zero. In a second application, I run placebo tests on control units from three datasets common to the SCM literature. The models are evaluated both on their ability to produce low error rates on control units (i.e., estimating treatment effects of zero) without yielding false positives. I find that either encoder-decoder networks or LSTM outperform SCM on each of the three datasets in terms of having the lowest MSPE, with FPRs comparable to SCM. The baseline LSTM performs well in all applications except for the mayoral elections application and outperforms encoder-decoder networks in two of the datasets. This is likely due to the LSTM being to shallow to learn useful features in datasets with high-dimensional predictor sets, which is the case for the mayoral elections data. When applied to datasets with low-dimensional predictor sets, the LSTM can perform well but deeper networks such as encoder-decoder networks are susceptible to overfitting. Overfitting in this case means that the networks learn dependencies on a small subset of predictors and cannot generalize well to unseen data. Overfitting occurs when training encoder-decoder networks on the Basque Country dataset, which is has the lowest dimensions of the three SCM datasets (Fig. 2a). Even in this case of obvious overfitting, model check-pointing is employed so that the model with the lowest validation error is used to produce counterfactual time-series. The results suggest that RNNs should outperform SCM in all cases as long as the complexity of the network architecture is proportional to the dimension of the predictor set. Encoder-decoder networks outperform the other models when the predictor set is comparatively large (i.e., J = 38 in the California dataset and J = 778 in the elections dataset), while the baseline LSTM outperforms all other models on smaller predictor sets (J = 16 for 12

14 Basque Country and West Germany datasets). Future work can explore predicting counterfactual time-series using only the treated unit outcomes as model inputs. As the study of German reunification demonstrates, appropriate control units are not always available, either because they are fundamentally different from treated units or due to spillover effects. Supplementary materials The Online Appendix (OA) contains a description of RNN architecture and implementation, and plots of RNN training history. For each application and model, it contains plots of the observed and predicted outcomes, post-period treatment effects and p-values. The OA can be accessed at jvpoulos.github.io/papers/rnns-causal-oa.pdf. Funding This work was supported by the National Science Foundation Graduate Research Fellowship [grant number DGE ]. The Titan Xp GPU used for this research was donated by the NVIDIA Corporation. Acknowledgements Jason Anastasopoulos suggested the idea of using a machine learning algorithm as an alternative to SCM. The paper benefited from discussions with Rafael Valle. 13

15 References Abadie, Alberto, Alexis Diamond, and Jens Hainmueller Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California s Tobacco Control Program. Journal of the American Statistical Association 105 (490): Comparative Politics and the Synthetic Control Method. American Journal of Political Science 59 (2): Abadie, Alberto, and Javier Gardeazabal The Economic Costs of Conflict: A Case Study of the Basque Country. The American Economic Review 93 (1): Athey, Susan, Mohsen Bayati, Nikolay Doudchenko, Guido Imbens, and Khashayar Khosravi Matrix completion methods for causal panel data models. arxiv preprint arxiv: Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio Neural Machine Translation by Jointly Learning to Align and Translate. arxiv preprint arxiv: Brodersen, Kay H, Fabian Gallusser, Jim Koehler, Nicolas Remy, and Steven L Scott Inferring Causal Impact Using Bayesian Structural Time-series Models. The Annals of Applied Statistics 9 (1): Cavallo, Eduardo, Sebastian Galiani, Ilan Noy, and Juan Pantano Catastrophic natural disasters and economic growth. Review of Economics and Statistics 95 (5): Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. arxiv preprint arxiv:

16 Chorowski, Jan K, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio Attention-based Models for Speech Recognition. In Advances in Neural Information Processing Systems, Cinar, Yagmur Gizem, Hamid Mirisaee, Parantapa Goswami, Eric Gaussier, Ali Aıt-Bachir, and Vadim Strijov Position-Based Content Attention for Time Series Forecasting with Sequence-to-Sequence RNNs. In International Conference on Neural Information Processing, Springer. Doudchenko, Nikolay, and Guido W Imbens Balancing, regression, difference-indifferences and synthetic control methods: A synthesis. Technical report. National Bureau of Economic Research. El Hihi, Salah, and Yoshua Bengio Hierarchical Recurrent Neural Networks for Long- Term Dependencies. In Neural Information Processing Systems, 400:409. Ferman, Bruno, and Cristine Pinto Revisiting the synthetic control estimator. Available at: Ferman, Bruno, Cristine Pinto, and Vitor Possebom Cherry picking with synthetic controls. Available at: pdf. Ferreira, Fernando, and Joseph Gyourko Do Political Parties Matter? Evidence from US Cities. The Quarterly Journal of Economics 124 (1): Mayoral Elections Survey Data. Received privately from Fernando Ferreira on Oct. 25, Goodfellow, Ian, Yoshua Bengio, and Aaron Courville Deep Learning. Cambridge, MA: MIT press. 15

17 Hopkins, Daniel Retrospective Voting in Big-City U.S. Mayoral Elections. dx.doi.org/ /dvn/kfikh8. Accessed Oct. 25, Hopkins, Daniel J, and Lindsay M Pettingill Retrospective Voting in Big-City US Mayoral Elections. Political Science Research and Methods: Kaul, Ashok, Stefan Klößner, Gregor Pfeifer, and Manuel Schieler Synthetic Control Methods: Never Use All Pre-Intervention Outcomes Together With Covariates. Available at: Klößner, Stefan, Ashok Kaul, Gregor Pfeifer, and Manuel Schieler Comparative politics and the synthetic control method revisited: A note on Abadie et al. (2015). Swiss Journal of Economics and Statistics. Panagopoulos, Costas, and Donald P Green. 2008a. Field Experiments Testing the Impact of Radio Advertisements on Electoral Competition. American Journal of Political Science 52 (1): b. Replication Materials for: Field Experiments Testing the Impact of Radio Advertisements on Electoral Competition. Accessed Oct. 25, Schmidhuber, Jürgen, and Sepp Hochreiter Long Short-Term Memory. Neural Computation 9 (8): Vaver, Jon, and Jim Koehler Measuring Ad Effectiveness Using Geo Experiments. Technical report. Google Inc. http : / / googleresearch. blogspot. com / 2011 / 12 / measuring-ad-effectiveness-using-geo.html. Vinyals, Oriol, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton Grammar as a Foreign Language. arxiv preprint arxiv:

18 Zhu, Lingxue, and Nikolay Laptev Deep and Confident Prediction for Time Series at Uber. arxiv preprint arxiv:

Cross-Validating Synthetic Controls

Cross-Validating Synthetic Controls Martin Becker, Saarland University Stefan Klößner, Saarland University Gregor Pfeifer, University of Hohenheim October 12, 2017 Abstract While the literature on synthetic