RNN-based counterfactual time-series prediction

Size: px
Start display at page:

Download "RNN-based counterfactual time-series prediction"

Transcription

1 RNN-based counterfactual time-series prediction Jason Poulos May 11, 2018 arxiv: v4 [stat.ml] 11 May 2018 Abstract This paper proposes an alternative to the synthetic control method (SCM) for estimating the effect of a policy intervention on an outcome over time. Recurrent neural networks (RNNs) are used to predict counterfactual time-series of treated unit outcomes using only the outcomes of control units as inputs. Unlike SCM, the proposed method does not rely on pre-intervention covariates, allows for nonconvex combinations of control units, can handle multiple treated units, and can share model parameters across time-steps. RNNs outperform SCM in terms of recovering experimental estimates from a field experiment extended to a time-series observational setting. In placebo tests run on three different benchmark datasets, RNNs are more accurate than SCM in predicting the post-intervention time-series of control units, while yielding a comparable proportion of false positives. The proposed method contributes to a new literature that uses machine learning techniques for data-driven counterfactual prediction. Graduate Student, Department of Political Science, University of California, Berkeley. Contact poulos@berkeley.edu.

2 1 Introduction An important problem in the social sciences is estimating the effect of a policy intervention on an outcome over time. When interventions take place at an aggregate level (e.g., city or state), researchers make causal inferences by comparing the post-intervention outcomes of affected ( treated ) units against the outcomes of unaffected units ( controls ). The synthetic control method (SCM) (Abadie, Diamond, and Hainmueller 2010) is a popular method for making causal inferences on observational time-series. The method compares a single treated unit outcome with a synthetic control that combines the outcomes of multiple control units on the basis of their pre-intervention similarity with the treated unit. Specifically, the synthetic control is constructed by choosing a convex combination of weights w = (w 1,..., w j ), w j 0, w j = 1, of control time-series that minimizes X 1 X 0 w v, where X 1 and X 0 are the pre-intervention covariates of the treated and control units, respectively, and v = (v 1,..., v J ) are importance weights chosen to minimize the prediction error produced by w v during cross-validation. The SCM has several limitations. First, the formation of the synthetic control is restricted to a convex combination of controls. Ferman and Pinto (2016) point out that this restriction implies that the SCM estimator may be biased even if selection into treatment is only correlated with time-invariant unobserved covariates. Second, the specification of the estimator can produce very different results. Ferman, Pinto, and Possebom (2018) show how cherry-picking between common SCM specifications can facilitate p-hacking. 1 Third, SCM cannot handle multiple treated units. This paper proposes an alternative to SCM that does not rely on pre-intervention covariates, allows for nonconvex combinations of control units, and can handle multiple treated units. The method uses recurrent neural networks (RNNs) for counterfactual time-series 1. Several problems arise from the lack of guidance on how to specify the SCM estimator. Kaul et al. (2015) show, for instance, that the common practice of including lagged versions of the outcome variable as separate predictors can render all other covariates irrelevant. Klößner et al. (2017) show that cross-validation can yield multiple values of importance weights, and consequently different results. 1

3 prediction, using only the outcomes of control units as model inputs. RNNs are a class of neural networks that take advantage of the sequential nature of time-series data by sharing model parameters across multiple time-steps (El Hihi and Bengio 1995). RNNs are capable of not only handling multiple treated units, but also sharing learned model weights when predicting the outcomes of multiple treated units. RNNs that are sufficiently deep can learn the most useful nonconvex combination of control unit outcomes at each time-step for generating counterfactual predictions. RNNs have been shown to outperform various linear models on time-series prediction tasks (Cinar et al. 2017). The proposed method builds on a new literature that uses machine learning techniques such as matrix completion (Athey et al. 2017) and Lasso (Doudchenko and Imbens 2016) for data-driven counterfactual prediction. Results from empirical applications demonstrate that RNNs outperform SCM in terms of recovering experimental estimates from an actual field experiment extended to a time-series observational setting. RNNs are also more accurate than SCM in predicting the post-intervention time-series of control units in a series of placebo tests. In Section 2, I describe the approach of using RNNs for counterfactual time-series prediction and details the procedure for evaluating the models in terms of predictive accuracy and statistical significance; Sections 3 and 4 evaluate the models in empirical applications; and Section 5 discusses when the proposed method is expected to outperform SCM. 2 RNNs for counterfactual prediction The proposed method estimates the causal effect of a discrete intervention in observational time-series data; i.e., settings in which treatment is not randomly assigned and there exists both pre- and post-intervention period observations of the outcome of interest. Brodersen et al. (2015) originally propose an alternative to SCM that uses the pre-period time-series of control units to train a model to predict the counterfactual time-series of the treated unit. 2

4 The key assumption of this approach is that the relationship between predictors x (1),..., x (τ) and the treated time-series y (1),..., y (τ) modeled prior to the intervention persists after the intervention. The approach also assumes that the control units are not themselves affected by the intervention; i.e., there is no spillover effect that would indeterminately bias the estimate and possibly lead to a false positive. In its most basic form, the pre-period relationship can be modeled by linear regression, y (t) = α 0 + βx (t) + ɛ (t) t = 1,..., n, where time-step t = 1,..., n,..., τ is the temporal ordering of the time-series and n denotes the end of the pre-period. As long as x (t) was not impacted by the intervention, it is plausible that the modeled relationship persists after the intervention. The fitted model is then used to predict the counterfactual time-series of the treated group in the post-period: ŷ (t) = α 1 + βx (t) + ζ (t) t = n , τ, The inferred treatment effect is the difference between the observed time-series of the treated units and the counterfactual time-series that would have been observed in the absence of the intervention: ˆφ (t) = y (t) ŷ (t) t = n + 1,..., τ. (1) 2.1 Predictive accuracy and statistical significance In applications in which the counterfactual time-series is known (i.e., placebo tests) models can be evaluated in terms of the MSPE between the predicted and actual post-intervention time-series among control units. Specifically, I calculate: MSPE = 1 τ n 3 τ n+1:τ ( ) 2 ˆφ(t), (2)

5 and the standard deviation η of the MSPE distribution over all control units is used to form corresponding intervals, [MSPE z α/2 η, MSPE+z α/2 η], where z α/2 is the upper α/2 quantile of the standard normal distribution. Eq. 2 measures the accuracy of the counterfactual predictions, and consequently the accuracy of the estimated treatment effect. However, this metric does not tell us anything about the statistical significance of estimated treatment effects. ADH (2010) propose a randomization inference approach for calculating the exact distribution of placebo effects under the null hypothesis. Following Cavallo et al. (2013), I extend this method to the case of multiple (placebo) treated units by constructing a distribution of average placebo effects following the procedure described in OA-Section 1. A single false positive rate is calculated by FPR = FP, where FP is the number of false positives defined as the number of (τ n) J p-values less than or equal to α = RNNs RNNs consist of a hidden state h (t) and an output y (t) which operate on a sequence x (t). At each time-step t, RNNs input x (t) and pass it to the hidden state, which is updated with a function g (t) using the entire history of the sequence (pp. 337 Goodfellow, Bengio, and Courville 2016): h (t) = g (t) ( x (1),..., x (t)) (3) = f ( h (t 1), x (t)), (4) where f( ) is a nonlinear function that operates on all time-steps and input lengths. The updated hidden layer is used to generate a sequence of output values o (t) in the form of log probabilities that correspond to x (t). The loss function computes ŷ (t) = linear(o (t) ) and compares this value to y (t). In the empirical applications, I calculate the loss in terms of 4

6 mean squared prediction error, L (t) MSPE = E [ (y (t) ŷ (t)) 2 ]. 2.3 Encoder-decoder networks A special variant of RNNs that are suitable for handling variable-length sequential data are encoder-decoder networks (Cho et al. 2014). Encoder-decoder networks are the standard for neural machine translation (Bahdanau, Cho, and Bengio 2014; Vinyals et al. 2014) and are also widely used for predictive tasks, including speech recognition (Chorowski et al. 2015) and time-series forecasting (Zhu and Laptev 2017). Encoder-decoder networks are trained to estimate the conditional distribution of the output sequence given the past input sequence, e.g., p(y (1),..., y (t) x (1),..., x (t) ), where the input and output sequence lengths can differ. The encoder RNN reads in x (t) sequentially and the hidden state of the network updates according to Eq. 3. The hidden state of the encoder is a context vector c that summarizes the input sequence, which is copied over to the decoder RNN. The decoder generates a variable-length output sequence by predicting y (t) given the encoder hidden state and the previous element of the output sequence. Thus, the hidden state of the decoder is updated recursively by h (t) = f ( h (t 1), y (t 1), c ), (5) and the conditional probability of the next element of the sequence is P (y (t) y (1),..., y (t 1), c) = f ( h (t), y (t 1), c ). Effectively, the decoder learns to generate outputs y (t) given the previous outputs, conditioned on the input sequence. 5

7 3 Application: Mayoral elections field experiment Panagopoulos and Green (2008a) (hereafter, PG ) randomly assigned get-out-the-vote radio ads to cities holding mayoral elections in 2005 and Random assignment occurred within matched pairs, with treated cities exposed to 50, 70, or 90 gross ratings points (GRPs) of radio advertising, and control cities exposed to 0 GRPs. The study investigates whether radio ads increase electoral competitiveness, defined as the percentage difference between the firstand second-place candidates. The authors estimate a negative effect of ad spending on vote margin, which is consistent with the hypothesis that ads increase electoral competitiveness. However, the treatment effect is only significant when estimated with a linear model that includes covariates such as past turnout. I extend the experimental data to an observational time-series setting by assembling a dataset of city-year observations of winner margin in mayoral elections. The primary source of data is the Mayoral Elections Survey Data (Ferreira and Gyourko 2017), which are described in Ferreira and Gyourko (2009). These data include winner, runner-up, and total vote counts used to construct the winner margin variable for 6,211 elections in 842 cities. The second data source is the replication data (Hopkins 2016) for Hopkins and Pettingill s (2017) study, which consists of 341 elections in 115 cities. Finally, I include winner margin data from the 74 experimental cities observed in 2005 and 2006 (P&G 2008b). 2 The assembled (unbalanced) panel dataset consists of winner margins from 6,578 elections in 934 cities, spanning 1945 to Experimental estimates Experimental estimates are obtained by estimating the following geo-based regression (GBR) model, proposed by Vaver and Koehler (2011) for evaluating location-targeted advertising experiments: 2. When datasets have overlapping city-year observations of winner margin, I choose the dataset of Ferreira and Gyourko (2017) because it has the best coverage. When a city has multiple elections in a year (e.g., to fill a vacancy), I take the mean across elections. 6

8 y s,1 = β 0 + β 1 y s,0 + β 2 τ s + β 3 strata s + µ s, (6) where y s,1 and y s,0 is the log of mean vote margin during the post- and pre-intervention periods for city s, respectively; τ s is the post-period level of ad spending in GRPs; and µ s is the error term. strata s is a vector of price strata dummies that control for the fact that the random assignment of ad spending was conducted within price strata. Since τ s is randomly assigned conditional on price strata, the ˆβ 2 captures the effect of ad spending on vote margin. Eq. 6 is estimated on balanced panels of PG experimental cities that have both pre- and post-period vote margin observations, using weighted least squares with weights w s = 1/y s,0 to control for heteroscedasticity due to differences in city size. 3 The standard errors for ˆβ 2 are estimated with non-parametric bootstrap stratified by price strata. The experimental estimates and corresponding 95% confidence intervals are presented in Panel A of Table 1. The log-level regression estimate on the pooled panel containing both 2005 and 2006 elections implies that each one-point GRP purchase increases the winner s margin by 0.05% [-0.8%, 1%]. In comparison, the pooled treatment effect estimate in PG is -0.11% [-0.26%, 0.03%]. 3.2 Observational estimates In this application, y (t) is a multivariate time-series that takes the value of winner margins in PG treated cities and x (t) is a multivariate time-series of J = 778 predictors consisting of both PG control cities and other non-treated cities, with missing predictor values imputed via linear interpolation. There are 47 pre-period time-steps (1948 to 2004) and two post-period time-steps (2005 and 2006). In this application, RNNs are trained on five input-output sequence pairs that are offset by one time-step into the future, with the last pair reserved for model validation. I train 3. Following the authors specifications, I include a year dummy in the pooled sample regression. In contrast to the authors specifications, I do not restrict the samples to cities with incumbents running against at least one opposing candidate in order to preserve sample sizes. 7

9 both encoder-decoder networks and a baseline RNN in the form of a single Long Short-Term Memory (LSTM) (Schmidhuber and Hochreiter 1997) on the observational panel dataset. In addition to using regularization techniques such as dropout and L2 loss, I check-point each RNN model throughout its training history and use the network weights with the lowest validation error to generate counterfactual predictions. The treatment effect estimates on observational data are calculated by Eq. 1 and presented in Panel B of Table 1. With respect to the MSPE between predicted and experimental estimates of the treatment effects, encoder-decoder networks outperform the synthetic control, which is constructed using winner margin means over the pre-period. The pooled encoder-decoder networks estimate implies that each one-point GRP purchase increases the winner s margin by 4% [2%, 6%], which overestimates the true experimental effect and leads to a false positive. LSTM and SCM also overestimate the true experimental effect, although the corresponding confidence intervals are wide enough that the estimates do not yield false positives. The LSTM performs particularly poorly on this dataset due to the fact that the network is too shallow to take advantage of the high-dimensional set of predictors. Model ˆφ (2005) ˆφ (2006) ˆφ (pooled) MSPE (2005) MSPE (2006) MSPE (Pooled) Panel A: Experimental GBR [-0.01, 0.01] [ -0.02, 0.02] [-0.008, 0.01] Panel B: Observational Encoder-decoder [ 0.009, 0.05] [0.04, 0.07] [0.02, 0.06] LSTM [-4.69, 7.28] [-7.52, 9.98] [-6.11, 8.63 ] SCM [-0.84, 0.69] [-0.91, 0.69] [-0.88, 0.69 ] Table 1: Comparison of experimental and observational estimates of causal impact of advertising campaign on log winner margin in PG treated cities. Experimental estimates are obtained by estimating Eq. 6 via weighted least squares on a balanced pre- post-period sample restricted to experimental cities (N = {31, 17, 48}). For experimental estimates, values in brackets represent 95% bootstrap confidence intervals constructed with 1,000 stratified bootstrap samples. For observational estimates, 95% randomization confidence intervals are calculated following the procedure described in Section OA-1.1. The shaded cells indicate the model with the lowest MSPE between the estimated treatment effect on the observational data and the corresponding experimental estimate. 8

10 4 Application: SCM placebo tests In a second set of analyses, I evaluate the proposed RNN-based approach on three datasets common to the SCM literature. In each dataset, I remove the actual treated unit and evaluate predictive accuracy on control units. The treatment effects on the control units are expected to be about zero, which would yield a low MSPE. 4.1 Basque Country Abadie and Gardeazabal (2003) estimate the economic impact of terrorism in the Basque Country during the period of 1968 to 1997 by comparing per-capita gross domestic product (GDP) in the Basque Country against a synthetic control region without terrorism. The synthetic control is constructed using pre-period measures of illiteracy, educational attainment, investment, and means of GDP. The time series begins in 1955, which leaves only n = 14 pre-period time-steps. Fig. OA-5 plots estimated treatment effects on control units for each model. We observe considerable variability in the SCM estimates and not as much variability in the RNNs. This is explained by the fact that SCM cannot handle multiple (placebo) treated units and thus a separate model has to be run for each (placebo) treated unit. RNNs can handle multiple treated units and thus benefit from parameter sharing across treated units. The standard deviation from the mean MPSE, which is reported in the first column of Panel A of Table 2, indicates that SCM is comparatively more variable in terms of its predictive accuracy. The baseline LSTM yields the lowest mean MSPE, ± Fig. OA-6 plots the per-period randomization p-values corresponding to treatment effects on treated and control units. p-values corresponding to treatment effects on the actual treated unit (i.e., Basque Country) are made by comparing the per-period treatment effects on Basque Country against the null distribution of average placebo effects following the procedure outlined in Section OA-1.1. SCM has the comparatively lowest FPR (Panel B of 9

11 Table 2: Evaluation metrics on SCM placebo tests. Panel A: MSPE Basque Country California Elections West Germany Encoder-decoder 0.01 ± ± ± ± 0.01 LSTM ± ± ± ± 0.01 SCM ± ± ± ± 0.11 Panel B: FPR Basque Country California Elections West Germany Encoder-decoder LSTM SCM Panel A: Comparison of error rates for predicting post-period time-series for control units. Error bars represent ± one standard deviation from the MSPE. Panel B: FPRs when predicting post-period time-series for control units. Table 2) among all models: the model falsely rejects the null hypothesis about a quarter of the time California ADH (2010) applies SCM to estimate the effects of a large-scale tobacco control program implemented in California in The study spans the period of 1970 to 2000, providing n = 19 pre-period time-steps. The synthetic control is constructed using pre-period covariates including income, beer sales, demographics, and means of the dependent variable, which is log per-capita cigarette consumption. Fig. OA-8 shows that for RNNs, control unit treatment effects are tightly centered around zero as expected whereas SCM control treatment effects are more dispersed. Encoderdecoder networks yield the lowest mean MSPE, ± 0.004, while yielding comparatively higher FPR. 4. Note that the p-values are not adjusted for multiple comparisons. 10

12 4.3 West Germany Lastly, ADH (2015) constructs a synthetic West Germany in order to estimate the impact of the 1990 German reunification on log real per-capita GDP during a post-period that extends to The time-series begins in 1960, which leaves n = 30 pre-period time-steps. The synthetic control is constructed using pre-period means of GDP, trade, industry, schooling, and investment. SCM and RNN-based approaches both assume the absence of spillover effects. ADH (2015) acknowledge that spillover effects is a valid concern in their study because German reunification likely have effects on GDP in the 16 OECD member countries that serve as controls. Indeed, Fig. OA-11 shows that RNNs estimate mostly positive (and increasing) treatment effects on control units, which suggests that estimate of the treatment effect on West Germany is biased upward. Overall, the baseline LSTM yields the lowest error in terms of mean MSPE, 0.02 ± 0.01, with a FRP comparable to SCM. 5 Discussion This paper proposes a novel alternative to SCM, which is growing in popularity in the social sciences despite its limitations; the most obvious being that the choice of specification can lead to different results, and thus facilitate p-hacking. Since RNNs input only control unit outcomes, and do not rely on pre-period covariates, the proposed method offers a more principled approach than SCM. RNNs are also capable of handling multiple treated units and can learn nonconvex combinations of control units. The former attribute is useful because the model can share parameters across treated units, and thus generate more precise predictions in settings in which treated units share similar data-generating processes. The latter attribute is beneficial when the data-generating process underlying the outcome of interest depends nonlinearly on the history of its inputs. 11

13 In the first empirical application, I extend experimental data from a field experiment that investigates the effect of randomized radio ads on electoral competition to an observational time-series setting. I find that encoder-decoder networks outperform both LSTM and SCM in recovering the ground-truth experimental estimate, although its comparatively narrow confidence intervals yield a false positive. The LSTM and SCM also overestimate the true experimental estimate, but have wider confidence intervals that contain zero. In a second application, I run placebo tests on control units from three datasets common to the SCM literature. The models are evaluated both on their ability to produce low error rates on control units (i.e., estimating treatment effects of zero) without yielding false positives. I find that either encoder-decoder networks or LSTM outperform SCM on each of the three datasets in terms of having the lowest MSPE, with FPRs comparable to SCM. The baseline LSTM performs well in all applications except for the mayoral elections application and outperforms encoder-decoder networks in two of the datasets. This is likely due to the LSTM being to shallow to learn useful features in datasets with high-dimensional predictor sets, which is the case for the mayoral elections data. When applied to datasets with low-dimensional predictor sets, the LSTM can perform well but deeper networks such as encoder-decoder networks are susceptible to overfitting. Overfitting in this case means that the networks learn dependencies on a small subset of predictors and cannot generalize well to unseen data. Overfitting occurs when training encoder-decoder networks on the Basque Country dataset, which is has the lowest dimensions of the three SCM datasets (Fig. 2a). Even in this case of obvious overfitting, model check-pointing is employed so that the model with the lowest validation error is used to produce counterfactual time-series. The results suggest that RNNs should outperform SCM in all cases as long as the complexity of the network architecture is proportional to the dimension of the predictor set. Encoder-decoder networks outperform the other models when the predictor set is comparatively large (i.e., J = 38 in the California dataset and J = 778 in the elections dataset), while the baseline LSTM outperforms all other models on smaller predictor sets (J = 16 for 12

14 Basque Country and West Germany datasets). Future work can explore predicting counterfactual time-series using only the treated unit outcomes as model inputs. As the study of German reunification demonstrates, appropriate control units are not always available, either because they are fundamentally different from treated units or due to spillover effects. Supplementary materials The Online Appendix (OA) contains a description of RNN architecture and implementation, and plots of RNN training history. For each application and model, it contains plots of the observed and predicted outcomes, post-period treatment effects and p-values. The OA can be accessed at jvpoulos.github.io/papers/rnns-causal-oa.pdf. Funding This work was supported by the National Science Foundation Graduate Research Fellowship [grant number DGE ]. The Titan Xp GPU used for this research was donated by the NVIDIA Corporation. Acknowledgements Jason Anastasopoulos suggested the idea of using a machine learning algorithm as an alternative to SCM. The paper benefited from discussions with Rafael Valle. 13

15 References Abadie, Alberto, Alexis Diamond, and Jens Hainmueller Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California s Tobacco Control Program. Journal of the American Statistical Association 105 (490): Comparative Politics and the Synthetic Control Method. American Journal of Political Science 59 (2): Abadie, Alberto, and Javier Gardeazabal The Economic Costs of Conflict: A Case Study of the Basque Country. The American Economic Review 93 (1): Athey, Susan, Mohsen Bayati, Nikolay Doudchenko, Guido Imbens, and Khashayar Khosravi Matrix completion methods for causal panel data models. arxiv preprint arxiv: Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio Neural Machine Translation by Jointly Learning to Align and Translate. arxiv preprint arxiv: Brodersen, Kay H, Fabian Gallusser, Jim Koehler, Nicolas Remy, and Steven L Scott Inferring Causal Impact Using Bayesian Structural Time-series Models. The Annals of Applied Statistics 9 (1): Cavallo, Eduardo, Sebastian Galiani, Ilan Noy, and Juan Pantano Catastrophic natural disasters and economic growth. Review of Economics and Statistics 95 (5): Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. arxiv preprint arxiv:

16 Chorowski, Jan K, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio Attention-based Models for Speech Recognition. In Advances in Neural Information Processing Systems, Cinar, Yagmur Gizem, Hamid Mirisaee, Parantapa Goswami, Eric Gaussier, Ali Aıt-Bachir, and Vadim Strijov Position-Based Content Attention for Time Series Forecasting with Sequence-to-Sequence RNNs. In International Conference on Neural Information Processing, Springer. Doudchenko, Nikolay, and Guido W Imbens Balancing, regression, difference-indifferences and synthetic control methods: A synthesis. Technical report. National Bureau of Economic Research. El Hihi, Salah, and Yoshua Bengio Hierarchical Recurrent Neural Networks for Long- Term Dependencies. In Neural Information Processing Systems, 400:409. Ferman, Bruno, and Cristine Pinto Revisiting the synthetic control estimator. Available at: Ferman, Bruno, Cristine Pinto, and Vitor Possebom Cherry picking with synthetic controls. Available at: pdf. Ferreira, Fernando, and Joseph Gyourko Do Political Parties Matter? Evidence from US Cities. The Quarterly Journal of Economics 124 (1): Mayoral Elections Survey Data. Received privately from Fernando Ferreira on Oct. 25, Goodfellow, Ian, Yoshua Bengio, and Aaron Courville Deep Learning. Cambridge, MA: MIT press. 15

17 Hopkins, Daniel Retrospective Voting in Big-City U.S. Mayoral Elections. dx.doi.org/ /dvn/kfikh8. Accessed Oct. 25, Hopkins, Daniel J, and Lindsay M Pettingill Retrospective Voting in Big-City US Mayoral Elections. Political Science Research and Methods: Kaul, Ashok, Stefan Klößner, Gregor Pfeifer, and Manuel Schieler Synthetic Control Methods: Never Use All Pre-Intervention Outcomes Together With Covariates. Available at: Klößner, Stefan, Ashok Kaul, Gregor Pfeifer, and Manuel Schieler Comparative politics and the synthetic control method revisited: A note on Abadie et al. (2015). Swiss Journal of Economics and Statistics. Panagopoulos, Costas, and Donald P Green. 2008a. Field Experiments Testing the Impact of Radio Advertisements on Electoral Competition. American Journal of Political Science 52 (1): b. Replication Materials for: Field Experiments Testing the Impact of Radio Advertisements on Electoral Competition. Accessed Oct. 25, Schmidhuber, Jürgen, and Sepp Hochreiter Long Short-Term Memory. Neural Computation 9 (8): Vaver, Jon, and Jim Koehler Measuring Ad Effectiveness Using Geo Experiments. Technical report. Google Inc. http : / / googleresearch. blogspot. com / 2011 / 12 / measuring-ad-effectiveness-using-geo.html. Vinyals, Oriol, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton Grammar as a Foreign Language. arxiv preprint arxiv:

18 Zhu, Lingxue, and Nikolay Laptev Deep and Confident Prediction for Time Series at Uber. arxiv preprint arxiv:

Cross-Validating Synthetic Controls

Cross-Validating Synthetic Controls Cross-Validating Synthetic Controls Martin Becker, Saarland University Stefan Klößner, Saarland University Gregor Pfeifer, University of Hohenheim October 12, 2017 Abstract While the literature on synthetic

More information

Outside the Box: Using Synthetic Control Methods as a Forecasting Technique

Outside the Box: Using Synthetic Control Methods as a Forecasting Technique Outside the Box: Using Synthetic Control Methods as a Forecasting Technique Stefan Klößner (Saarland University) Gregor Pfeifer (University of Hohenheim) October 6, 2016 Abstract We introduce synthetic

More information

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

High Order LSTM/GRU. Wenjie Luo. January 19, 2016 High Order LSTM/GRU Wenjie Luo January 19, 2016 1 Introduction RNN is a powerful model for sequence data but suffers from gradient vanishing and explosion, thus difficult to be trained to capture long

More information

Improved Learning through Augmenting the Loss

Improved Learning through Augmenting the Loss Improved Learning through Augmenting the Loss Hakan Inan inanh@stanford.edu Khashayar Khosravi khosravi@stanford.edu Abstract We present two improvements to the well-known Recurrent Neural Network Language

More information

arxiv: v1 [cs.cl] 21 May 2017

arxiv: v1 [cs.cl] 21 May 2017 Spelling Correction as a Foreign Language Yingbo Zhou yingbzhou@ebay.com Utkarsh Porwal uporwal@ebay.com Roberto Konow rkonow@ebay.com arxiv:1705.07371v1 [cs.cl] 21 May 2017 Abstract In this paper, we

More information

Comparative Politics and the Synthetic Control Method Revisited: A Note on Abadie et al. (2015)

Comparative Politics and the Synthetic Control Method Revisited: A Note on Abadie et al. (2015) Comparative Politics and the Synthetic Control Method Revisited: A Note on Abadie et al. (2015) Stefan Klößner, Saarland University Ashok Kaul, Saarland University Gregor Pfeifer, University of Hohenheim

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee New Activation Function Rectified Linear Unit (ReLU) σ z a a = z Reason: 1. Fast to compute 2. Biological reason a = 0 [Xavier Glorot, AISTATS 11] [Andrew L.

More information

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE- Workshop track - ICLR COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE- CURRENT NEURAL NETWORKS Daniel Fojo, Víctor Campos, Xavier Giró-i-Nieto Universitat Politècnica de Catalunya, Barcelona Supercomputing

More information

CSC321 Lecture 15: Recurrent Neural Networks

CSC321 Lecture 15: Recurrent Neural Networks CSC321 Lecture 15: Recurrent Neural Networks Roger Grosse Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 1 / 26 Overview Sometimes we re interested in predicting sequences Speech-to-text and

More information

Random Coattention Forest for Question Answering

Random Coattention Forest for Question Answering Random Coattention Forest for Question Answering Jheng-Hao Chen Stanford University jhenghao@stanford.edu Ting-Po Lee Stanford University tingpo@stanford.edu Yi-Chun Chen Stanford University yichunc@stanford.edu

More information

Synthetic Control Methods: Never Use All Pre-Intervention Outcomes Together With Covariates

Synthetic Control Methods: Never Use All Pre-Intervention Outcomes Together With Covariates Synthetic Control Methods: Never Use All Pre-Intervention Outcomes Together With Covariates Ashok Kaul, Saarland University Stefan Klößner, Saarland University Gregor Pfeifer, University of Hohenheim Manuel

More information

CSC321 Lecture 10 Training RNNs

CSC321 Lecture 10 Training RNNs CSC321 Lecture 10 Training RNNs Roger Grosse and Nitish Srivastava February 23, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 1 / 18 Overview Last time, we saw

More information

Self-Attention with Relative Position Representations

Self-Attention with Relative Position Representations Self-Attention with Relative Position Representations Peter Shaw Google petershaw@google.com Jakob Uszkoreit Google Brain usz@google.com Ashish Vaswani Google Brain avaswani@google.com Abstract Relying

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Deep learning for Natural Language Processing and Machine Translation

Deep learning for Natural Language Processing and Machine Translation Deep learning for Natural Language Processing and Machine Translation 2015.10.16 Seung-Hoon Na Contents Introduction: Neural network, deep learning Deep learning for Natural language processing Neural

More information

A QUESTION ANSWERING SYSTEM USING ENCODER-DECODER, SEQUENCE-TO-SEQUENCE, RECURRENT NEURAL NETWORKS. A Project. Presented to

A QUESTION ANSWERING SYSTEM USING ENCODER-DECODER, SEQUENCE-TO-SEQUENCE, RECURRENT NEURAL NETWORKS. A Project. Presented to A QUESTION ANSWERING SYSTEM USING ENCODER-DECODER, SEQUENCE-TO-SEQUENCE, RECURRENT NEURAL NETWORKS A Project Presented to The Faculty of the Department of Computer Science San José State University In

More information

Revisiting the Synthetic Control Estimator

Revisiting the Synthetic Control Estimator Revisiting the Synthetic Control Estimator Bruno Ferman Cristine Pinto Sao Paulo School of Economics - FGV First Draft: June, 2016 This Draft: August, 2017 Please click here for the most recent version

More information

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data?

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data? When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data? Kosuke Imai Department of Politics Center for Statistics and Machine Learning Princeton University

More information

Combining Static and Dynamic Information for Clinical Event Prediction

Combining Static and Dynamic Information for Clinical Event Prediction Combining Static and Dynamic Information for Clinical Event Prediction Cristóbal Esteban 1, Antonio Artés 2, Yinchong Yang 1, Oliver Staeck 3, Enrique Baca-García 4 and Volker Tresp 1 1 Siemens AG and

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

Faster Training of Very Deep Networks Via p-norm Gates

Faster Training of Very Deep Networks Via p-norm Gates Faster Training of Very Deep Networks Via p-norm Gates Trang Pham, Truyen Tran, Dinh Phung, Svetha Venkatesh Center for Pattern Recognition and Data Analytics Deakin University, Geelong Australia Email:

More information

EE-559 Deep learning LSTM and GRU

EE-559 Deep learning LSTM and GRU EE-559 Deep learning 11.2. LSTM and GRU François Fleuret https://fleuret.org/ee559/ Mon Feb 18 13:33:24 UTC 2019 ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE The Long-Short Term Memory unit (LSTM) by Hochreiter

More information

Revisiting the Synthetic Control Estimator

Revisiting the Synthetic Control Estimator Revisiting the Synthetic Control Estimator Bruno Ferman Cristine Pinto Sao Paulo School of Economics - FGV First Draft: June, 26 This Draft: October, 27 Please click here for the most recent version Abstract

More information

CKY-based Convolutional Attention for Neural Machine Translation

CKY-based Convolutional Attention for Neural Machine Translation CKY-based Convolutional Attention for Neural Machine Translation Taiki Watanabe and Akihiro Tamura and Takashi Ninomiya Ehime University 3 Bunkyo-cho, Matsuyama, Ehime, JAPAN {t_watanabe@ai.cs, tamura@cs,

More information

Revisiting the Synthetic Control Estimator

Revisiting the Synthetic Control Estimator MPRA Munich Personal RePEc Archive Revisiting the Synthetic Control Estimator Bruno Ferman and Cristine Pinto Sao Paulo School of Economics - FGV 23 November 2016 Online at https://mpra.ub.uni-muenchen.de/77930/

More information

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM c 1. (Natural Language Processing; NLP) (Deep Learning) RGB IBM 135 8511 5 6 52 yutat@jp.ibm.com a) b) 2. 1 0 2 1 Bag of words White House 2 [1] 2015 4 Copyright c by ORSJ. Unauthorized reproduction of

More information

Flexible Estimation of Treatment Effect Parameters

Flexible Estimation of Treatment Effect Parameters Flexible Estimation of Treatment Effect Parameters Thomas MaCurdy a and Xiaohong Chen b and Han Hong c Introduction Many empirical studies of program evaluations are complicated by the presence of both

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech

More information

Statistical Models for Causal Analysis

Statistical Models for Causal Analysis Statistical Models for Causal Analysis Teppei Yamamoto Keio University Introduction to Causal Inference Spring 2016 Three Modes of Statistical Inference 1. Descriptive Inference: summarizing and exploring

More information

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention CS23: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention Today s outline We will learn how to: I. Word Vector Representation i. Training - Generalize results with word vectors -

More information

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Panel Data?

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Panel Data? When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Panel Data? Kosuke Imai Department of Politics Center for Statistics and Machine Learning Princeton University Joint

More information

arxiv: v3 [cs.lg] 14 Jan 2018

arxiv: v3 [cs.lg] 14 Jan 2018 A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation Gang Chen Department of Computer Science and Engineering, SUNY at Buffalo arxiv:1610.02583v3 [cs.lg] 14 Jan 2018 1 abstract We describe

More information

Deep Gate Recurrent Neural Network

Deep Gate Recurrent Neural Network JMLR: Workshop and Conference Proceedings 63:350 365, 2016 ACML 2016 Deep Gate Recurrent Neural Network Yuan Gao University of Helsinki Dorota Glowacka University of Helsinki gaoyuankidult@gmail.com glowacka@cs.helsinki.fi

More information

Cherry Picking with Synthetic Controls

Cherry Picking with Synthetic Controls 1 Working Paper 420 Junho de 2016 Cherry Picking with Synthetic Controls Bruno Ferman Cristine Pinto Vitor Possebom Os artigos dos Textos para Discussão da Escola de Economia de São Paulo da Fundação Getulio

More information

Big Data, Machine Learning, and Causal Inference

Big Data, Machine Learning, and Causal Inference Big Data, Machine Learning, and Causal Inference I C T 4 E v a l I n t e r n a t i o n a l C o n f e r e n c e I F A D H e a d q u a r t e r s, R o m e, I t a l y P a u l J a s p e r p a u l. j a s p e

More information

arxiv: v1 [cs.ne] 19 Dec 2016

arxiv: v1 [cs.ne] 19 Dec 2016 A RECURRENT NEURAL NETWORK WITHOUT CHAOS Thomas Laurent Department of Mathematics Loyola Marymount University Los Angeles, CA 90045, USA tlaurent@lmu.edu James von Brecht Department of Mathematics California

More information

Inter-document Contextual Language Model

Inter-document Contextual Language Model Inter-document Contextual Language Model Quan Hung Tran and Ingrid Zukerman and Gholamreza Haffari Faculty of Information Technology Monash University, Australia hung.tran,ingrid.zukerman,gholamreza.haffari@monash.edu

More information

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning Recurrent Neural Network (RNNs) University of Waterloo October 23, 2015 Slides are partially based on Book in preparation, by Bengio, Goodfellow, and Aaron Courville, 2015 Sequential data Recurrent neural

More information

Causal Inference Lecture Notes: Causal Inference with Repeated Measures in Observational Studies

Causal Inference Lecture Notes: Causal Inference with Repeated Measures in Observational Studies Causal Inference Lecture Notes: Causal Inference with Repeated Measures in Observational Studies Kosuke Imai Department of Politics Princeton University November 13, 2013 So far, we have essentially assumed

More information

Gated Feedback Recurrent Neural Networks

Gated Feedback Recurrent Neural Networks Junyoung Chung Caglar Gulcehre Kyunghyun Cho Yoshua Bengio Dept. IRO, Université de Montréal, CIFAR Senior Fellow JUNYOUNG.CHUNG@UMONTREAL.CA CAGLAR.GULCEHRE@UMONTREAL.CA KYUNGHYUN.CHO@UMONTREAL.CA FIND-ME@THE.WEB

More information

Introduction to RNNs!

Introduction to RNNs! Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher The RNN

More information

PSIque: Next Sequence Prediction of Satellite Images using a Convolutional Sequence-to-Sequence Network

PSIque: Next Sequence Prediction of Satellite Images using a Convolutional Sequence-to-Sequence Network PSIque: Next Sequence Prediction of Satellite Images using a Convolutional Sequence-to-Sequence Network Seungkyun Hong,1,2 Seongchan Kim 2 Minsu Joh 1,2 Sa-kwang Song,1,2 1 Korea University of Science

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

arxiv: v2 [cs.cl] 1 Jan 2019

arxiv: v2 [cs.cl] 1 Jan 2019 Variational Self-attention Model for Sentence Representation arxiv:1812.11559v2 [cs.cl] 1 Jan 2019 Qiang Zhang 1, Shangsong Liang 2, Emine Yilmaz 1 1 University College London, London, United Kingdom 2

More information

Gov 2002: 9. Differences in Differences

Gov 2002: 9. Differences in Differences Gov 2002: 9. Differences in Differences Matthew Blackwell October 30, 2015 1 / 40 1. Basic differences-in-differences model 2. Conditional DID 3. Standard error issues 4. Other DID approaches 2 / 40 Where

More information

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data?

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data? When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data? Kosuke Imai Princeton University Asian Political Methodology Conference University of Sydney Joint

More information

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function Austin Wang Adviser: Xiuyuan Cheng May 4, 2017 1 Abstract This study analyzes how simple recurrent neural

More information

Recurrent Autoregressive Networks for Online Multi-Object Tracking. Presented By: Ishan Gupta

Recurrent Autoregressive Networks for Online Multi-Object Tracking. Presented By: Ishan Gupta Recurrent Autoregressive Networks for Online Multi-Object Tracking Presented By: Ishan Gupta Outline Multi Object Tracking Recurrent Autoregressive Networks (RANs) RANs for Online Tracking Other State

More information

Short-term water demand forecast based on deep neural network ABSTRACT

Short-term water demand forecast based on deep neural network ABSTRACT Short-term water demand forecast based on deep neural network Guancheng Guo 1, Shuming Liu 2 1,2 School of Environment, Tsinghua University, 100084, Beijing, China 2 shumingliu@tsinghua.edu.cn ABSTRACT

More information

Statistical Inference for Average Treatment Effects Estimated by Synthetic Control Methods

Statistical Inference for Average Treatment Effects Estimated by Synthetic Control Methods Statistical Inference for Average Treatment Effects Estimated by Synthetic Control Methods Kathleen T. Li Marketing Department The Wharton School, the University of Pennsylvania March 2018 Abstract The

More information

Seq2Tree: A Tree-Structured Extension of LSTM Network

Seq2Tree: A Tree-Structured Extension of LSTM Network Seq2Tree: A Tree-Structured Extension of LSTM Network Weicheng Ma Computer Science Department, Boston University 111 Cummington Mall, Boston, MA wcma@bu.edu Kai Cao Cambia Health Solutions kai.cao@cambiahealth.com

More information

Deep Counterfactual Networks with Propensity-Dropout

Deep Counterfactual Networks with Propensity-Dropout with Propensity-Dropout Ahmed M. Alaa, 1 Michael Weisz, 2 Mihaela van der Schaar 1 2 3 Abstract We propose a novel approach for inferring the individualized causal effects of a treatment (intervention)

More information

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate

More information

ESTIMATION OF TREATMENT EFFECTS VIA MATCHING

ESTIMATION OF TREATMENT EFFECTS VIA MATCHING ESTIMATION OF TREATMENT EFFECTS VIA MATCHING AAEC 56 INSTRUCTOR: KLAUS MOELTNER Textbooks: R scripts: Wooldridge (00), Ch.; Greene (0), Ch.9; Angrist and Pischke (00), Ch. 3 mod5s3 General Approach The

More information

Gov 2002: 3. Randomization Inference

Gov 2002: 3. Randomization Inference Gov 2002: 3. Randomization Inference Matthew Blackwell September 10, 2015 Where are we? Where are we going? Last week: This week: What can we identify using randomization? Estimators were justified via

More information

Trajectory-based Radical Analysis Network for Online Handwritten Chinese Character Recognition

Trajectory-based Radical Analysis Network for Online Handwritten Chinese Character Recognition Trajectory-based Radical Analysis Network for Online Handwritten Chinese Character Recognition Jianshu Zhang, Yixing Zhu, Jun Du and Lirong Dai National Engineering Laboratory for Speech and Language Information

More information

arxiv: v1 [cs.lg] 7 Apr 2017

arxiv: v1 [cs.lg] 7 Apr 2017 A Dual-Stage Attention-Based Recurrent Neural Network for Time Series Prediction Yao Qin, Dongjin Song 2, Haifeng Cheng 2, Wei Cheng 2, Guofei Jiang 2, and Garrison Cottrell University of California, San

More information

arxiv: v1 [cs.lg] 2 Feb 2018

arxiv: v1 [cs.lg] 2 Feb 2018 Short-term Memory of Deep RNN Claudio Gallicchio arxiv:1802.00748v1 [cs.lg] 2 Feb 2018 Department of Computer Science, University of Pisa Largo Bruno Pontecorvo 3-56127 Pisa, Italy Abstract. The extension

More information

Learning Chaotic Dynamics using Tensor Recurrent Neural Networks

Learning Chaotic Dynamics using Tensor Recurrent Neural Networks Rose Yu 1 * Stephan Zheng 2 * Yan Liu 1 Abstract We present Tensor-RNN, a novel RNN architecture for multivariate forecasting in chaotic dynamical systems. Our proposed architecture captures highly nonlinear

More information

Matrix Completion Methods. for Causal Panel Data Models

Matrix Completion Methods. for Causal Panel Data Models Matrix Completion Methods for Causal Panel Data Models Susan Athey, Mohsen Bayati, Nikolay Doudchenko, Guido Imbens, & Khashayar Khosravi (Stanford University) NBER Summer Institute Cambridge, July 27th,

More information

Introduction to Deep Learning CMPT 733. Steven Bergner

Introduction to Deep Learning CMPT 733. Steven Bergner Introduction to Deep Learning CMPT 733 Steven Bergner Overview Renaissance of artificial neural networks Representation learning vs feature engineering Background Linear Algebra, Optimization Regularization

More information

Selection on Observables: Propensity Score Matching.

Selection on Observables: Propensity Score Matching. Selection on Observables: Propensity Score Matching. Department of Economics and Management Irene Brunetti ireneb@ec.unipi.it 24/10/2017 I. Brunetti Labour Economics in an European Perspective 24/10/2017

More information

Empirical approaches in public economics

Empirical approaches in public economics Empirical approaches in public economics ECON4624 Empirical Public Economics Fall 2016 Gaute Torsvik Outline for today The canonical problem Basic concepts of causal inference Randomized experiments Non-experimental

More information

arxiv: v1 [cs.lg] 28 Dec 2017

arxiv: v1 [cs.lg] 28 Dec 2017 PixelSNAIL: An Improved Autoregressive Generative Model arxiv:1712.09763v1 [cs.lg] 28 Dec 2017 Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, Pieter Abbeel Embodied Intelligence UC Berkeley, Department of

More information

Inference for Factor Model Based Average Treatment

Inference for Factor Model Based Average Treatment Inference for Factor Model Based Average Treatment Effects Kathleen T. Li Marketing Department The Wharton School, the University of Pennsylvania January, 208 Abstract In this paper we consider the problem

More information

Deep Learning Architecture for Univariate Time Series Forecasting

Deep Learning Architecture for Univariate Time Series Forecasting CS229,Technical Report, 2014 Deep Learning Architecture for Univariate Time Series Forecasting Dmitry Vengertsev 1 Abstract This paper studies the problem of applying machine learning with deep architecture

More information

Edinburgh Research Explorer

Edinburgh Research Explorer Edinburgh Research Explorer Nematus: a Toolkit for Neural Machine Translation Citation for published version: Sennrich R Firat O Cho K Birch-Mayne A Haddow B Hitschler J Junczys-Dowmunt M Läubli S Miceli

More information

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions 2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors: S. Scardapane, S. Van Vaerenbergh,

More information

arxiv: v1 [cs.cl] 22 Jun 2015

arxiv: v1 [cs.cl] 22 Jun 2015 Neural Transformation Machine: A New Architecture for Sequence-to-Sequence Learning arxiv:1506.06442v1 [cs.cl] 22 Jun 2015 Fandong Meng 1 Zhengdong Lu 2 Zhaopeng Tu 2 Hang Li 2 and Qun Liu 1 1 Institute

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

arxiv: v1 [stat.ml] 18 Nov 2017

arxiv: v1 [stat.ml] 18 Nov 2017 MinimalRNN: Toward More Interpretable and Trainable Recurrent Neural Networks arxiv:1711.06788v1 [stat.ml] 18 Nov 2017 Minmin Chen Google Mountain view, CA 94043 minminc@google.com Abstract We introduce

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services

More information

A Bayesian Perspective on Residential Demand Response Using Smart Meter Data

A Bayesian Perspective on Residential Demand Response Using Smart Meter Data A Bayesian Perspective on Residential Demand Response Using Smart Meter Data Datong-Paul Zhou, Maximilian Balandat, and Claire Tomlin University of California, Berkeley [datong.zhou, balandat, tomlin]@eecs.berkeley.edu

More information

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng Deep Sequence Models Context Representation, Regularization, and Application to Language Adji Bousso Dieng All Data Are Born Sequential Time underlies many interesting human behaviors. Elman, 1990. Why

More information

Independent and conditionally independent counterfactual distributions

Independent and conditionally independent counterfactual distributions Independent and conditionally independent counterfactual distributions Marcin Wolski European Investment Bank M.Wolski@eib.org Society for Nonlinear Dynamics and Econometrics Tokyo March 19, 2018 Views

More information

On the use of Long-Short Term Memory neural networks for time series prediction

On the use of Long-Short Term Memory neural networks for time series prediction On the use of Long-Short Term Memory neural networks for time series prediction Pilar Gómez-Gil National Institute of Astrophysics, Optics and Electronics ccc.inaoep.mx/~pgomez In collaboration with: J.

More information

Applied Microeconometrics (L5): Panel Data-Basics

Applied Microeconometrics (L5): Panel Data-Basics Applied Microeconometrics (L5): Panel Data-Basics Nicholas Giannakopoulos University of Patras Department of Economics ngias@upatras.gr November 10, 2015 Nicholas Giannakopoulos (UPatras) MSc Applied Economics

More information

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji Dynamic Data Modeling, Recognition, and Synthesis Rui Zhao Thesis Defense Advisor: Professor Qiang Ji Contents Introduction Related Work Dynamic Data Modeling & Analysis Temporal localization Insufficient

More information

abstract sergio firpo and vítor possebom april 2 nd, 2016

abstract sergio firpo and vítor possebom april 2 nd, 2016 SY N T H E T I C C O N T R O L E S T I M AT O R : A G E N E R A L I Z E D I N F E R E N C E P R O C E D U R E A N D C O N F I D E N C E S E T S * sergio firpo and vítor possebom april 2 nd, 2016 abstract

More information

Robust Synthetic Control

Robust Synthetic Control Journal of Machine Learning Research 19 (2018) 1-51 Submitted 12/17; Revised 5/18; Published 8/18 Robust Synthetic Control Muhammad Amjad Operations Research Center Massachusetts Institute of Technology

More information

Regression Discontinuity Designs

Regression Discontinuity Designs Regression Discontinuity Designs Kosuke Imai Harvard University STAT186/GOV2002 CAUSAL INFERENCE Fall 2018 Kosuke Imai (Harvard) Regression Discontinuity Design Stat186/Gov2002 Fall 2018 1 / 1 Observational

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Key-value Attention Mechanism for Neural Machine Translation

Key-value Attention Mechanism for Neural Machine Translation Key-value Attention Mechanism for Neural Machine Translation Hideya Mino 1,2 Masao Utiyama 3 Eiichiro Sumita 3 Takenobu Tokunaga 2 1 NHK Science & Technology Research Laboratories 2 Tokyo Institute of

More information

Long-term Forecasting using Tensor-Train RNNs

Long-term Forecasting using Tensor-Train RNNs Long-term Forecasting using Tensor-Train RNNs Rose Yu Stephan Zheng Anima Anandkumar Yisong Yue Department of Computing and Mathematical Sciences Caltech, Pasadena, CA {rose, stephan, anima, yyue}@caltech.edu

More information

OPTIMIZATION METHODS IN DEEP LEARNING

OPTIMIZATION METHODS IN DEEP LEARNING Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate

More information

Recurrent Residual Learning for Sequence Classification

Recurrent Residual Learning for Sequence Classification Recurrent Residual Learning for Sequence Classification Yiren Wang University of Illinois at Urbana-Champaign yiren@illinois.edu Fei ian Microsoft Research fetia@microsoft.com Abstract In this paper, we

More information

8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7]

8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7] 8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7] While cross-validation allows one to find the weight penalty parameters which would give the model good generalization capability, the separation of

More information

Implicitly-Defined Neural Networks for Sequence Labeling

Implicitly-Defined Neural Networks for Sequence Labeling Implicitly-Defined Neural Networks for Sequence Labeling Michaeel Kazi MIT Lincoln Laboratory 244 Wood St, Lexington, MA, 02420, USA michaeel.kazi@ll.mit.edu Abstract We relax the causality assumption

More information

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, 2015 @ Deep Lunch 1 Why LSTM? OJen used in many recent RNN- based systems Machine transla;on Program execu;on Can

More information

arxiv: v1 [econ.em] 18 Nov 2017

arxiv: v1 [econ.em] 18 Nov 2017 Robust Synthetic Control Muhammad Jehangir Amjad, Devavrat Shah, and Dennis Shen arxiv:1711.06940v1 [econ.em] 18 Nov 2017 Laboratory for Information and Decision Systems, Statistics and Data Science Center,

More information

Diversifying Neural Conversation Model with Maximal Marginal Relevance

Diversifying Neural Conversation Model with Maximal Marginal Relevance Diversifying Neural Conversation Model with Maximal Marginal Relevance Yiping Song, 1 Zhiliang Tian, 2 Dongyan Zhao, 2 Ming Zhang, 1 Rui Yan 2 Institute of Network Computing and Information Systems, Peking

More information

arxiv: v1 [econ.em] 19 Feb 2019

arxiv: v1 [econ.em] 19 Feb 2019 Estimation and Inference for Synthetic Control Methods with Spillover Effects arxiv:1902.07343v1 [econ.em] 19 Feb 2019 Jianfei Cao Connor Dowd February 21, 2019 Abstract he synthetic control method is

More information

A Machine Learning Approach for Virtual Flow Metering and Forecasting

A Machine Learning Approach for Virtual Flow Metering and Forecasting To appear in Proc. of 3rd IFAC Workshop on Automatic Control in Offshore Oil and Gas Production, Esbjerg, Denmark, May 30 June 01, 2018. A Machine Learning Approach for Virtual Flow Metering and Forecasting

More information

Long-Short Term Memory and Other Gated RNNs

Long-Short Term Memory and Other Gated RNNs Long-Short Term Memory and Other Gated RNNs Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Sequence Modeling

More information

IMPROVING STOCHASTIC GRADIENT DESCENT

IMPROVING STOCHASTIC GRADIENT DESCENT IMPROVING STOCHASTIC GRADIENT DESCENT WITH FEEDBACK Jayanth Koushik & Hiroaki Hayashi Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA {jkoushik,hiroakih}@cs.cmu.edu

More information

Supporting Information. Controlling the Airwaves: Incumbency Advantage and. Community Radio in Brazil

Supporting Information. Controlling the Airwaves: Incumbency Advantage and. Community Radio in Brazil Supporting Information Controlling the Airwaves: Incumbency Advantage and Community Radio in Brazil Taylor C. Boas F. Daniel Hidalgo May 7, 2011 1 1 Descriptive Statistics Descriptive statistics for the

More information

RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS

RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS 2018 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 17 20, 2018, AALBORG, DENMARK RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS Simone Scardapane,

More information

Review Networks for Caption Generation

Review Networks for Caption Generation Review Networks for Caption Generation Zhilin Yang, Ye Yuan, Yuexin Wu, Ruslan Salakhutdinov, William W. Cohen School of Computer Science Carnegie Mellon University {zhiliny,yey1,yuexinw,rsalakhu,wcohen}@cs.cmu.edu

More information