RNN-based counterfactual time-series prediction
|
|
- Clement Cunningham
- 6 years ago
- Views:
Transcription
1 RNN-based counterfactual time-series prediction Jason Poulos May 11, 2018 arxiv: v4 [stat.ml] 11 May 2018 Abstract This paper proposes an alternative to the synthetic control method (SCM) for estimating the effect of a policy intervention on an outcome over time. Recurrent neural networks (RNNs) are used to predict counterfactual time-series of treated unit outcomes using only the outcomes of control units as inputs. Unlike SCM, the proposed method does not rely on pre-intervention covariates, allows for nonconvex combinations of control units, can handle multiple treated units, and can share model parameters across time-steps. RNNs outperform SCM in terms of recovering experimental estimates from a field experiment extended to a time-series observational setting. In placebo tests run on three different benchmark datasets, RNNs are more accurate than SCM in predicting the post-intervention time-series of control units, while yielding a comparable proportion of false positives. The proposed method contributes to a new literature that uses machine learning techniques for data-driven counterfactual prediction. Graduate Student, Department of Political Science, University of California, Berkeley. Contact poulos@berkeley.edu.
2 1 Introduction An important problem in the social sciences is estimating the effect of a policy intervention on an outcome over time. When interventions take place at an aggregate level (e.g., city or state), researchers make causal inferences by comparing the post-intervention outcomes of affected ( treated ) units against the outcomes of unaffected units ( controls ). The synthetic control method (SCM) (Abadie, Diamond, and Hainmueller 2010) is a popular method for making causal inferences on observational time-series. The method compares a single treated unit outcome with a synthetic control that combines the outcomes of multiple control units on the basis of their pre-intervention similarity with the treated unit. Specifically, the synthetic control is constructed by choosing a convex combination of weights w = (w 1,..., w j ), w j 0, w j = 1, of control time-series that minimizes X 1 X 0 w v, where X 1 and X 0 are the pre-intervention covariates of the treated and control units, respectively, and v = (v 1,..., v J ) are importance weights chosen to minimize the prediction error produced by w v during cross-validation. The SCM has several limitations. First, the formation of the synthetic control is restricted to a convex combination of controls. Ferman and Pinto (2016) point out that this restriction implies that the SCM estimator may be biased even if selection into treatment is only correlated with time-invariant unobserved covariates. Second, the specification of the estimator can produce very different results. Ferman, Pinto, and Possebom (2018) show how cherry-picking between common SCM specifications can facilitate p-hacking. 1 Third, SCM cannot handle multiple treated units. This paper proposes an alternative to SCM that does not rely on pre-intervention covariates, allows for nonconvex combinations of control units, and can handle multiple treated units. The method uses recurrent neural networks (RNNs) for counterfactual time-series 1. Several problems arise from the lack of guidance on how to specify the SCM estimator. Kaul et al. (2015) show, for instance, that the common practice of including lagged versions of the outcome variable as separate predictors can render all other covariates irrelevant. Klößner et al. (2017) show that cross-validation can yield multiple values of importance weights, and consequently different results. 1
3 prediction, using only the outcomes of control units as model inputs. RNNs are a class of neural networks that take advantage of the sequential nature of time-series data by sharing model parameters across multiple time-steps (El Hihi and Bengio 1995). RNNs are capable of not only handling multiple treated units, but also sharing learned model weights when predicting the outcomes of multiple treated units. RNNs that are sufficiently deep can learn the most useful nonconvex combination of control unit outcomes at each time-step for generating counterfactual predictions. RNNs have been shown to outperform various linear models on time-series prediction tasks (Cinar et al. 2017). The proposed method builds on a new literature that uses machine learning techniques such as matrix completion (Athey et al. 2017) and Lasso (Doudchenko and Imbens 2016) for data-driven counterfactual prediction. Results from empirical applications demonstrate that RNNs outperform SCM in terms of recovering experimental estimates from an actual field experiment extended to a time-series observational setting. RNNs are also more accurate than SCM in predicting the post-intervention time-series of control units in a series of placebo tests. In Section 2, I describe the approach of using RNNs for counterfactual time-series prediction and details the procedure for evaluating the models in terms of predictive accuracy and statistical significance; Sections 3 and 4 evaluate the models in empirical applications; and Section 5 discusses when the proposed method is expected to outperform SCM. 2 RNNs for counterfactual prediction The proposed method estimates the causal effect of a discrete intervention in observational time-series data; i.e., settings in which treatment is not randomly assigned and there exists both pre- and post-intervention period observations of the outcome of interest. Brodersen et al. (2015) originally propose an alternative to SCM that uses the pre-period time-series of control units to train a model to predict the counterfactual time-series of the treated unit. 2
4 The key assumption of this approach is that the relationship between predictors x (1),..., x (τ) and the treated time-series y (1),..., y (τ) modeled prior to the intervention persists after the intervention. The approach also assumes that the control units are not themselves affected by the intervention; i.e., there is no spillover effect that would indeterminately bias the estimate and possibly lead to a false positive. In its most basic form, the pre-period relationship can be modeled by linear regression, y (t) = α 0 + βx (t) + ɛ (t) t = 1,..., n, where time-step t = 1,..., n,..., τ is the temporal ordering of the time-series and n denotes the end of the pre-period. As long as x (t) was not impacted by the intervention, it is plausible that the modeled relationship persists after the intervention. The fitted model is then used to predict the counterfactual time-series of the treated group in the post-period: ŷ (t) = α 1 + βx (t) + ζ (t) t = n , τ, The inferred treatment effect is the difference between the observed time-series of the treated units and the counterfactual time-series that would have been observed in the absence of the intervention: ˆφ (t) = y (t) ŷ (t) t = n + 1,..., τ. (1) 2.1 Predictive accuracy and statistical significance In applications in which the counterfactual time-series is known (i.e., placebo tests) models can be evaluated in terms of the MSPE between the predicted and actual post-intervention time-series among control units. Specifically, I calculate: MSPE = 1 τ n 3 τ n+1:τ ( ) 2 ˆφ(t), (2)
5 and the standard deviation η of the MSPE distribution over all control units is used to form corresponding intervals, [MSPE z α/2 η, MSPE+z α/2 η], where z α/2 is the upper α/2 quantile of the standard normal distribution. Eq. 2 measures the accuracy of the counterfactual predictions, and consequently the accuracy of the estimated treatment effect. However, this metric does not tell us anything about the statistical significance of estimated treatment effects. ADH (2010) propose a randomization inference approach for calculating the exact distribution of placebo effects under the null hypothesis. Following Cavallo et al. (2013), I extend this method to the case of multiple (placebo) treated units by constructing a distribution of average placebo effects following the procedure described in OA-Section 1. A single false positive rate is calculated by FPR = FP, where FP is the number of false positives defined as the number of (τ n) J p-values less than or equal to α = RNNs RNNs consist of a hidden state h (t) and an output y (t) which operate on a sequence x (t). At each time-step t, RNNs input x (t) and pass it to the hidden state, which is updated with a function g (t) using the entire history of the sequence (pp. 337 Goodfellow, Bengio, and Courville 2016): h (t) = g (t) ( x (1),..., x (t)) (3) = f ( h (t 1), x (t)), (4) where f( ) is a nonlinear function that operates on all time-steps and input lengths. The updated hidden layer is used to generate a sequence of output values o (t) in the form of log probabilities that correspond to x (t). The loss function computes ŷ (t) = linear(o (t) ) and compares this value to y (t). In the empirical applications, I calculate the loss in terms of 4
6 mean squared prediction error, L (t) MSPE = E [ (y (t) ŷ (t)) 2 ]. 2.3 Encoder-decoder networks A special variant of RNNs that are suitable for handling variable-length sequential data are encoder-decoder networks (Cho et al. 2014). Encoder-decoder networks are the standard for neural machine translation (Bahdanau, Cho, and Bengio 2014; Vinyals et al. 2014) and are also widely used for predictive tasks, including speech recognition (Chorowski et al. 2015) and time-series forecasting (Zhu and Laptev 2017). Encoder-decoder networks are trained to estimate the conditional distribution of the output sequence given the past input sequence, e.g., p(y (1),..., y (t) x (1),..., x (t) ), where the input and output sequence lengths can differ. The encoder RNN reads in x (t) sequentially and the hidden state of the network updates according to Eq. 3. The hidden state of the encoder is a context vector c that summarizes the input sequence, which is copied over to the decoder RNN. The decoder generates a variable-length output sequence by predicting y (t) given the encoder hidden state and the previous element of the output sequence. Thus, the hidden state of the decoder is updated recursively by h (t) = f ( h (t 1), y (t 1), c ), (5) and the conditional probability of the next element of the sequence is P (y (t) y (1),..., y (t 1), c) = f ( h (t), y (t 1), c ). Effectively, the decoder learns to generate outputs y (t) given the previous outputs, conditioned on the input sequence. 5
7 3 Application: Mayoral elections field experiment Panagopoulos and Green (2008a) (hereafter, PG ) randomly assigned get-out-the-vote radio ads to cities holding mayoral elections in 2005 and Random assignment occurred within matched pairs, with treated cities exposed to 50, 70, or 90 gross ratings points (GRPs) of radio advertising, and control cities exposed to 0 GRPs. The study investigates whether radio ads increase electoral competitiveness, defined as the percentage difference between the firstand second-place candidates. The authors estimate a negative effect of ad spending on vote margin, which is consistent with the hypothesis that ads increase electoral competitiveness. However, the treatment effect is only significant when estimated with a linear model that includes covariates such as past turnout. I extend the experimental data to an observational time-series setting by assembling a dataset of city-year observations of winner margin in mayoral elections. The primary source of data is the Mayoral Elections Survey Data (Ferreira and Gyourko 2017), which are described in Ferreira and Gyourko (2009). These data include winner, runner-up, and total vote counts used to construct the winner margin variable for 6,211 elections in 842 cities. The second data source is the replication data (Hopkins 2016) for Hopkins and Pettingill s (2017) study, which consists of 341 elections in 115 cities. Finally, I include winner margin data from the 74 experimental cities observed in 2005 and 2006 (P&G 2008b). 2 The assembled (unbalanced) panel dataset consists of winner margins from 6,578 elections in 934 cities, spanning 1945 to Experimental estimates Experimental estimates are obtained by estimating the following geo-based regression (GBR) model, proposed by Vaver and Koehler (2011) for evaluating location-targeted advertising experiments: 2. When datasets have overlapping city-year observations of winner margin, I choose the dataset of Ferreira and Gyourko (2017) because it has the best coverage. When a city has multiple elections in a year (e.g., to fill a vacancy), I take the mean across elections. 6
8 y s,1 = β 0 + β 1 y s,0 + β 2 τ s + β 3 strata s + µ s, (6) where y s,1 and y s,0 is the log of mean vote margin during the post- and pre-intervention periods for city s, respectively; τ s is the post-period level of ad spending in GRPs; and µ s is the error term. strata s is a vector of price strata dummies that control for the fact that the random assignment of ad spending was conducted within price strata. Since τ s is randomly assigned conditional on price strata, the ˆβ 2 captures the effect of ad spending on vote margin. Eq. 6 is estimated on balanced panels of PG experimental cities that have both pre- and post-period vote margin observations, using weighted least squares with weights w s = 1/y s,0 to control for heteroscedasticity due to differences in city size. 3 The standard errors for ˆβ 2 are estimated with non-parametric bootstrap stratified by price strata. The experimental estimates and corresponding 95% confidence intervals are presented in Panel A of Table 1. The log-level regression estimate on the pooled panel containing both 2005 and 2006 elections implies that each one-point GRP purchase increases the winner s margin by 0.05% [-0.8%, 1%]. In comparison, the pooled treatment effect estimate in PG is -0.11% [-0.26%, 0.03%]. 3.2 Observational estimates In this application, y (t) is a multivariate time-series that takes the value of winner margins in PG treated cities and x (t) is a multivariate time-series of J = 778 predictors consisting of both PG control cities and other non-treated cities, with missing predictor values imputed via linear interpolation. There are 47 pre-period time-steps (1948 to 2004) and two post-period time-steps (2005 and 2006). In this application, RNNs are trained on five input-output sequence pairs that are offset by one time-step into the future, with the last pair reserved for model validation. I train 3. Following the authors specifications, I include a year dummy in the pooled sample regression. In contrast to the authors specifications, I do not restrict the samples to cities with incumbents running against at least one opposing candidate in order to preserve sample sizes. 7
9 both encoder-decoder networks and a baseline RNN in the form of a single Long Short-Term Memory (LSTM) (Schmidhuber and Hochreiter 1997) on the observational panel dataset. In addition to using regularization techniques such as dropout and L2 loss, I check-point each RNN model throughout its training history and use the network weights with the lowest validation error to generate counterfactual predictions. The treatment effect estimates on observational data are calculated by Eq. 1 and presented in Panel B of Table 1. With respect to the MSPE between predicted and experimental estimates of the treatment effects, encoder-decoder networks outperform the synthetic control, which is constructed using winner margin means over the pre-period. The pooled encoder-decoder networks estimate implies that each one-point GRP purchase increases the winner s margin by 4% [2%, 6%], which overestimates the true experimental effect and leads to a false positive. LSTM and SCM also overestimate the true experimental effect, although the corresponding confidence intervals are wide enough that the estimates do not yield false positives. The LSTM performs particularly poorly on this dataset due to the fact that the network is too shallow to take advantage of the high-dimensional set of predictors. Model ˆφ (2005) ˆφ (2006) ˆφ (pooled) MSPE (2005) MSPE (2006) MSPE (Pooled) Panel A: Experimental GBR [-0.01, 0.01] [ -0.02, 0.02] [-0.008, 0.01] Panel B: Observational Encoder-decoder [ 0.009, 0.05] [0.04, 0.07] [0.02, 0.06] LSTM [-4.69, 7.28] [-7.52, 9.98] [-6.11, 8.63 ] SCM [-0.84, 0.69] [-0.91, 0.69] [-0.88, 0.69 ] Table 1: Comparison of experimental and observational estimates of causal impact of advertising campaign on log winner margin in PG treated cities. Experimental estimates are obtained by estimating Eq. 6 via weighted least squares on a balanced pre- post-period sample restricted to experimental cities (N = {31, 17, 48}). For experimental estimates, values in brackets represent 95% bootstrap confidence intervals constructed with 1,000 stratified bootstrap samples. For observational estimates, 95% randomization confidence intervals are calculated following the procedure described in Section OA-1.1. The shaded cells indicate the model with the lowest MSPE between the estimated treatment effect on the observational data and the corresponding experimental estimate. 8
10 4 Application: SCM placebo tests In a second set of analyses, I evaluate the proposed RNN-based approach on three datasets common to the SCM literature. In each dataset, I remove the actual treated unit and evaluate predictive accuracy on control units. The treatment effects on the control units are expected to be about zero, which would yield a low MSPE. 4.1 Basque Country Abadie and Gardeazabal (2003) estimate the economic impact of terrorism in the Basque Country during the period of 1968 to 1997 by comparing per-capita gross domestic product (GDP) in the Basque Country against a synthetic control region without terrorism. The synthetic control is constructed using pre-period measures of illiteracy, educational attainment, investment, and means of GDP. The time series begins in 1955, which leaves only n = 14 pre-period time-steps. Fig. OA-5 plots estimated treatment effects on control units for each model. We observe considerable variability in the SCM estimates and not as much variability in the RNNs. This is explained by the fact that SCM cannot handle multiple (placebo) treated units and thus a separate model has to be run for each (placebo) treated unit. RNNs can handle multiple treated units and thus benefit from parameter sharing across treated units. The standard deviation from the mean MPSE, which is reported in the first column of Panel A of Table 2, indicates that SCM is comparatively more variable in terms of its predictive accuracy. The baseline LSTM yields the lowest mean MSPE, ± Fig. OA-6 plots the per-period randomization p-values corresponding to treatment effects on treated and control units. p-values corresponding to treatment effects on the actual treated unit (i.e., Basque Country) are made by comparing the per-period treatment effects on Basque Country against the null distribution of average placebo effects following the procedure outlined in Section OA-1.1. SCM has the comparatively lowest FPR (Panel B of 9
11 Table 2: Evaluation metrics on SCM placebo tests. Panel A: MSPE Basque Country California Elections West Germany Encoder-decoder 0.01 ± ± ± ± 0.01 LSTM ± ± ± ± 0.01 SCM ± ± ± ± 0.11 Panel B: FPR Basque Country California Elections West Germany Encoder-decoder LSTM SCM Panel A: Comparison of error rates for predicting post-period time-series for control units. Error bars represent ± one standard deviation from the MSPE. Panel B: FPRs when predicting post-period time-series for control units. Table 2) among all models: the model falsely rejects the null hypothesis about a quarter of the time California ADH (2010) applies SCM to estimate the effects of a large-scale tobacco control program implemented in California in The study spans the period of 1970 to 2000, providing n = 19 pre-period time-steps. The synthetic control is constructed using pre-period covariates including income, beer sales, demographics, and means of the dependent variable, which is log per-capita cigarette consumption. Fig. OA-8 shows that for RNNs, control unit treatment effects are tightly centered around zero as expected whereas SCM control treatment effects are more dispersed. Encoderdecoder networks yield the lowest mean MSPE, ± 0.004, while yielding comparatively higher FPR. 4. Note that the p-values are not adjusted for multiple comparisons. 10
12 4.3 West Germany Lastly, ADH (2015) constructs a synthetic West Germany in order to estimate the impact of the 1990 German reunification on log real per-capita GDP during a post-period that extends to The time-series begins in 1960, which leaves n = 30 pre-period time-steps. The synthetic control is constructed using pre-period means of GDP, trade, industry, schooling, and investment. SCM and RNN-based approaches both assume the absence of spillover effects. ADH (2015) acknowledge that spillover effects is a valid concern in their study because German reunification likely have effects on GDP in the 16 OECD member countries that serve as controls. Indeed, Fig. OA-11 shows that RNNs estimate mostly positive (and increasing) treatment effects on control units, which suggests that estimate of the treatment effect on West Germany is biased upward. Overall, the baseline LSTM yields the lowest error in terms of mean MSPE, 0.02 ± 0.01, with a FRP comparable to SCM. 5 Discussion This paper proposes a novel alternative to SCM, which is growing in popularity in the social sciences despite its limitations; the most obvious being that the choice of specification can lead to different results, and thus facilitate p-hacking. Since RNNs input only control unit outcomes, and do not rely on pre-period covariates, the proposed method offers a more principled approach than SCM. RNNs are also capable of handling multiple treated units and can learn nonconvex combinations of control units. The former attribute is useful because the model can share parameters across treated units, and thus generate more precise predictions in settings in which treated units share similar data-generating processes. The latter attribute is beneficial when the data-generating process underlying the outcome of interest depends nonlinearly on the history of its inputs. 11
13 In the first empirical application, I extend experimental data from a field experiment that investigates the effect of randomized radio ads on electoral competition to an observational time-series setting. I find that encoder-decoder networks outperform both LSTM and SCM in recovering the ground-truth experimental estimate, although its comparatively narrow confidence intervals yield a false positive. The LSTM and SCM also overestimate the true experimental estimate, but have wider confidence intervals that contain zero. In a second application, I run placebo tests on control units from three datasets common to the SCM literature. The models are evaluated both on their ability to produce low error rates on control units (i.e., estimating treatment effects of zero) without yielding false positives. I find that either encoder-decoder networks or LSTM outperform SCM on each of the three datasets in terms of having the lowest MSPE, with FPRs comparable to SCM. The baseline LSTM performs well in all applications except for the mayoral elections application and outperforms encoder-decoder networks in two of the datasets. This is likely due to the LSTM being to shallow to learn useful features in datasets with high-dimensional predictor sets, which is the case for the mayoral elections data. When applied to datasets with low-dimensional predictor sets, the LSTM can perform well but deeper networks such as encoder-decoder networks are susceptible to overfitting. Overfitting in this case means that the networks learn dependencies on a small subset of predictors and cannot generalize well to unseen data. Overfitting occurs when training encoder-decoder networks on the Basque Country dataset, which is has the lowest dimensions of the three SCM datasets (Fig. 2a). Even in this case of obvious overfitting, model check-pointing is employed so that the model with the lowest validation error is used to produce counterfactual time-series. The results suggest that RNNs should outperform SCM in all cases as long as the complexity of the network architecture is proportional to the dimension of the predictor set. Encoder-decoder networks outperform the other models when the predictor set is comparatively large (i.e., J = 38 in the California dataset and J = 778 in the elections dataset), while the baseline LSTM outperforms all other models on smaller predictor sets (J = 16 for 12
14 Basque Country and West Germany datasets). Future work can explore predicting counterfactual time-series using only the treated unit outcomes as model inputs. As the study of German reunification demonstrates, appropriate control units are not always available, either because they are fundamentally different from treated units or due to spillover effects. Supplementary materials The Online Appendix (OA) contains a description of RNN architecture and implementation, and plots of RNN training history. For each application and model, it contains plots of the observed and predicted outcomes, post-period treatment effects and p-values. The OA can be accessed at jvpoulos.github.io/papers/rnns-causal-oa.pdf. Funding This work was supported by the National Science Foundation Graduate Research Fellowship [grant number DGE ]. The Titan Xp GPU used for this research was donated by the NVIDIA Corporation. Acknowledgements Jason Anastasopoulos suggested the idea of using a machine learning algorithm as an alternative to SCM. The paper benefited from discussions with Rafael Valle. 13
15 References Abadie, Alberto, Alexis Diamond, and Jens Hainmueller Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California s Tobacco Control Program. Journal of the American Statistical Association 105 (490): Comparative Politics and the Synthetic Control Method. American Journal of Political Science 59 (2): Abadie, Alberto, and Javier Gardeazabal The Economic Costs of Conflict: A Case Study of the Basque Country. The American Economic Review 93 (1): Athey, Susan, Mohsen Bayati, Nikolay Doudchenko, Guido Imbens, and Khashayar Khosravi Matrix completion methods for causal panel data models. arxiv preprint arxiv: Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio Neural Machine Translation by Jointly Learning to Align and Translate. arxiv preprint arxiv: Brodersen, Kay H, Fabian Gallusser, Jim Koehler, Nicolas Remy, and Steven L Scott Inferring Causal Impact Using Bayesian Structural Time-series Models. The Annals of Applied Statistics 9 (1): Cavallo, Eduardo, Sebastian Galiani, Ilan Noy, and Juan Pantano Catastrophic natural disasters and economic growth. Review of Economics and Statistics 95 (5): Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. arxiv preprint arxiv:
16 Chorowski, Jan K, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio Attention-based Models for Speech Recognition. In Advances in Neural Information Processing Systems, Cinar, Yagmur Gizem, Hamid Mirisaee, Parantapa Goswami, Eric Gaussier, Ali Aıt-Bachir, and Vadim Strijov Position-Based Content Attention for Time Series Forecasting with Sequence-to-Sequence RNNs. In International Conference on Neural Information Processing, Springer. Doudchenko, Nikolay, and Guido W Imbens Balancing, regression, difference-indifferences and synthetic control methods: A synthesis. Technical report. National Bureau of Economic Research. El Hihi, Salah, and Yoshua Bengio Hierarchical Recurrent Neural Networks for Long- Term Dependencies. In Neural Information Processing Systems, 400:409. Ferman, Bruno, and Cristine Pinto Revisiting the synthetic control estimator. Available at: Ferman, Bruno, Cristine Pinto, and Vitor Possebom Cherry picking with synthetic controls. Available at: pdf. Ferreira, Fernando, and Joseph Gyourko Do Political Parties Matter? Evidence from US Cities. The Quarterly Journal of Economics 124 (1): Mayoral Elections Survey Data. Received privately from Fernando Ferreira on Oct. 25, Goodfellow, Ian, Yoshua Bengio, and Aaron Courville Deep Learning. Cambridge, MA: MIT press. 15
17 Hopkins, Daniel Retrospective Voting in Big-City U.S. Mayoral Elections. dx.doi.org/ /dvn/kfikh8. Accessed Oct. 25, Hopkins, Daniel J, and Lindsay M Pettingill Retrospective Voting in Big-City US Mayoral Elections. Political Science Research and Methods: Kaul, Ashok, Stefan Klößner, Gregor Pfeifer, and Manuel Schieler Synthetic Control Methods: Never Use All Pre-Intervention Outcomes Together With Covariates. Available at: Klößner, Stefan, Ashok Kaul, Gregor Pfeifer, and Manuel Schieler Comparative politics and the synthetic control method revisited: A note on Abadie et al. (2015). Swiss Journal of Economics and Statistics. Panagopoulos, Costas, and Donald P Green. 2008a. Field Experiments Testing the Impact of Radio Advertisements on Electoral Competition. American Journal of Political Science 52 (1): b. Replication Materials for: Field Experiments Testing the Impact of Radio Advertisements on Electoral Competition. Accessed Oct. 25, Schmidhuber, Jürgen, and Sepp Hochreiter Long Short-Term Memory. Neural Computation 9 (8): Vaver, Jon, and Jim Koehler Measuring Ad Effectiveness Using Geo Experiments. Technical report. Google Inc. http : / / googleresearch. blogspot. com / 2011 / 12 / measuring-ad-effectiveness-using-geo.html. Vinyals, Oriol, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton Grammar as a Foreign Language. arxiv preprint arxiv:
18 Zhu, Lingxue, and Nikolay Laptev Deep and Confident Prediction for Time Series at Uber. arxiv preprint arxiv:
Cross-Validating Synthetic Controls
Cross-Validating Synthetic Controls Martin Becker, Saarland University Stefan Klößner, Saarland University Gregor Pfeifer, University of Hohenheim October 12, 2017 Abstract While the literature on synthetic
More informationOutside the Box: Using Synthetic Control Methods as a Forecasting Technique
Outside the Box: Using Synthetic Control Methods as a Forecasting Technique Stefan Klößner (Saarland University) Gregor Pfeifer (University of Hohenheim) October 6, 2016 Abstract We introduce synthetic
More informationHigh Order LSTM/GRU. Wenjie Luo. January 19, 2016
High Order LSTM/GRU Wenjie Luo January 19, 2016 1 Introduction RNN is a powerful model for sequence data but suffers from gradient vanishing and explosion, thus difficult to be trained to capture long
More informationImproved Learning through Augmenting the Loss
Improved Learning through Augmenting the Loss Hakan Inan inanh@stanford.edu Khashayar Khosravi khosravi@stanford.edu Abstract We present two improvements to the well-known Recurrent Neural Network Language
More informationarxiv: v1 [cs.cl] 21 May 2017
Spelling Correction as a Foreign Language Yingbo Zhou yingbzhou@ebay.com Utkarsh Porwal uporwal@ebay.com Roberto Konow rkonow@ebay.com arxiv:1705.07371v1 [cs.cl] 21 May 2017 Abstract In this paper, we
More informationComparative Politics and the Synthetic Control Method Revisited: A Note on Abadie et al. (2015)
Comparative Politics and the Synthetic Control Method Revisited: A Note on Abadie et al. (2015) Stefan Klößner, Saarland University Ashok Kaul, Saarland University Gregor Pfeifer, University of Hohenheim
More informationBased on the original slides of Hung-yi Lee
Based on the original slides of Hung-yi Lee New Activation Function Rectified Linear Unit (ReLU) σ z a a = z Reason: 1. Fast to compute 2. Biological reason a = 0 [Xavier Glorot, AISTATS 11] [Andrew L.
More informationCOMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-
Workshop track - ICLR COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE- CURRENT NEURAL NETWORKS Daniel Fojo, Víctor Campos, Xavier Giró-i-Nieto Universitat Politècnica de Catalunya, Barcelona Supercomputing
More informationCSC321 Lecture 15: Recurrent Neural Networks
CSC321 Lecture 15: Recurrent Neural Networks Roger Grosse Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 1 / 26 Overview Sometimes we re interested in predicting sequences Speech-to-text and
More informationRandom Coattention Forest for Question Answering
Random Coattention Forest for Question Answering Jheng-Hao Chen Stanford University jhenghao@stanford.edu Ting-Po Lee Stanford University tingpo@stanford.edu Yi-Chun Chen Stanford University yichunc@stanford.edu
More informationSynthetic Control Methods: Never Use All Pre-Intervention Outcomes Together With Covariates
Synthetic Control Methods: Never Use All Pre-Intervention Outcomes Together With Covariates Ashok Kaul, Saarland University Stefan Klößner, Saarland University Gregor Pfeifer, University of Hohenheim Manuel
More informationCSC321 Lecture 10 Training RNNs
CSC321 Lecture 10 Training RNNs Roger Grosse and Nitish Srivastava February 23, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 1 / 18 Overview Last time, we saw
More informationSelf-Attention with Relative Position Representations
Self-Attention with Relative Position Representations Peter Shaw Google petershaw@google.com Jakob Uszkoreit Google Brain usz@google.com Ashish Vaswani Google Brain avaswani@google.com Abstract Relying
More informationDeep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści
Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?
More informationDeep learning for Natural Language Processing and Machine Translation
Deep learning for Natural Language Processing and Machine Translation 2015.10.16 Seung-Hoon Na Contents Introduction: Neural network, deep learning Deep learning for Natural language processing Neural
More informationA QUESTION ANSWERING SYSTEM USING ENCODER-DECODER, SEQUENCE-TO-SEQUENCE, RECURRENT NEURAL NETWORKS. A Project. Presented to
A QUESTION ANSWERING SYSTEM USING ENCODER-DECODER, SEQUENCE-TO-SEQUENCE, RECURRENT NEURAL NETWORKS A Project Presented to The Faculty of the Department of Computer Science San José State University In
More informationRevisiting the Synthetic Control Estimator
Revisiting the Synthetic Control Estimator Bruno Ferman Cristine Pinto Sao Paulo School of Economics - FGV First Draft: June, 2016 This Draft: August, 2017 Please click here for the most recent version
More informationWhen Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data?
When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data? Kosuke Imai Department of Politics Center for Statistics and Machine Learning Princeton University
More informationCombining Static and Dynamic Information for Clinical Event Prediction
Combining Static and Dynamic Information for Clinical Event Prediction Cristóbal Esteban 1, Antonio Artés 2, Yinchong Yang 1, Oliver Staeck 3, Enrique Baca-García 4 and Volker Tresp 1 1 Siemens AG and
More informationIntroduction to Deep Neural Networks
Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic
More informationFaster Training of Very Deep Networks Via p-norm Gates
Faster Training of Very Deep Networks Via p-norm Gates Trang Pham, Truyen Tran, Dinh Phung, Svetha Venkatesh Center for Pattern Recognition and Data Analytics Deakin University, Geelong Australia Email:
More informationEE-559 Deep learning LSTM and GRU
EE-559 Deep learning 11.2. LSTM and GRU François Fleuret https://fleuret.org/ee559/ Mon Feb 18 13:33:24 UTC 2019 ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE The Long-Short Term Memory unit (LSTM) by Hochreiter
More informationRevisiting the Synthetic Control Estimator
Revisiting the Synthetic Control Estimator Bruno Ferman Cristine Pinto Sao Paulo School of Economics - FGV First Draft: June, 26 This Draft: October, 27 Please click here for the most recent version Abstract
More informationCKY-based Convolutional Attention for Neural Machine Translation
CKY-based Convolutional Attention for Neural Machine Translation Taiki Watanabe and Akihiro Tamura and Takashi Ninomiya Ehime University 3 Bunkyo-cho, Matsuyama, Ehime, JAPAN {t_watanabe@ai.cs, tamura@cs,
More informationRevisiting the Synthetic Control Estimator
MPRA Munich Personal RePEc Archive Revisiting the Synthetic Control Estimator Bruno Ferman and Cristine Pinto Sao Paulo School of Economics - FGV 23 November 2016 Online at https://mpra.ub.uni-muenchen.de/77930/
More informationa) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM
c 1. (Natural Language Processing; NLP) (Deep Learning) RGB IBM 135 8511 5 6 52 yutat@jp.ibm.com a) b) 2. 1 0 2 1 Bag of words White House 2 [1] 2015 4 Copyright c by ORSJ. Unauthorized reproduction of
More informationFlexible Estimation of Treatment Effect Parameters
Flexible Estimation of Treatment Effect Parameters Thomas MaCurdy a and Xiaohong Chen b and Han Hong c Introduction Many empirical studies of program evaluations are complicated by the presence of both
More informationMachine Learning Linear Regression. Prof. Matteo Matteucci
Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationRecurrent Neural Networks (Part - 2) Sumit Chopra Facebook
Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech
More informationStatistical Models for Causal Analysis
Statistical Models for Causal Analysis Teppei Yamamoto Keio University Introduction to Causal Inference Spring 2016 Three Modes of Statistical Inference 1. Descriptive Inference: summarizing and exploring
More informationCS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention
CS23: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention Today s outline We will learn how to: I. Word Vector Representation i. Training - Generalize results with word vectors -
More informationWhen Should We Use Linear Fixed Effects Regression Models for Causal Inference with Panel Data?
When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Panel Data? Kosuke Imai Department of Politics Center for Statistics and Machine Learning Princeton University Joint
More informationarxiv: v3 [cs.lg] 14 Jan 2018
A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation Gang Chen Department of Computer Science and Engineering, SUNY at Buffalo arxiv:1610.02583v3 [cs.lg] 14 Jan 2018 1 abstract We describe
More informationDeep Gate Recurrent Neural Network
JMLR: Workshop and Conference Proceedings 63:350 365, 2016 ACML 2016 Deep Gate Recurrent Neural Network Yuan Gao University of Helsinki Dorota Glowacka University of Helsinki gaoyuankidult@gmail.com glowacka@cs.helsinki.fi
More informationCherry Picking with Synthetic Controls
1 Working Paper 420 Junho de 2016 Cherry Picking with Synthetic Controls Bruno Ferman Cristine Pinto Vitor Possebom Os artigos dos Textos para Discussão da Escola de Economia de São Paulo da Fundação Getulio
More informationBig Data, Machine Learning, and Causal Inference
Big Data, Machine Learning, and Causal Inference I C T 4 E v a l I n t e r n a t i o n a l C o n f e r e n c e I F A D H e a d q u a r t e r s, R o m e, I t a l y P a u l J a s p e r p a u l. j a s p e
More informationarxiv: v1 [cs.ne] 19 Dec 2016
A RECURRENT NEURAL NETWORK WITHOUT CHAOS Thomas Laurent Department of Mathematics Loyola Marymount University Los Angeles, CA 90045, USA tlaurent@lmu.edu James von Brecht Department of Mathematics California
More informationInter-document Contextual Language Model
Inter-document Contextual Language Model Quan Hung Tran and Ingrid Zukerman and Gholamreza Haffari Faculty of Information Technology Monash University, Australia hung.tran,ingrid.zukerman,gholamreza.haffari@monash.edu
More informationDeep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning
Recurrent Neural Network (RNNs) University of Waterloo October 23, 2015 Slides are partially based on Book in preparation, by Bengio, Goodfellow, and Aaron Courville, 2015 Sequential data Recurrent neural
More informationCausal Inference Lecture Notes: Causal Inference with Repeated Measures in Observational Studies
Causal Inference Lecture Notes: Causal Inference with Repeated Measures in Observational Studies Kosuke Imai Department of Politics Princeton University November 13, 2013 So far, we have essentially assumed
More informationGated Feedback Recurrent Neural Networks
Junyoung Chung Caglar Gulcehre Kyunghyun Cho Yoshua Bengio Dept. IRO, Université de Montréal, CIFAR Senior Fellow JUNYOUNG.CHUNG@UMONTREAL.CA CAGLAR.GULCEHRE@UMONTREAL.CA KYUNGHYUN.CHO@UMONTREAL.CA FIND-ME@THE.WEB
More informationIntroduction to RNNs!
Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher The RNN
More informationPSIque: Next Sequence Prediction of Satellite Images using a Convolutional Sequence-to-Sequence Network
PSIque: Next Sequence Prediction of Satellite Images using a Convolutional Sequence-to-Sequence Network Seungkyun Hong,1,2 Seongchan Kim 2 Minsu Joh 1,2 Sa-kwang Song,1,2 1 Korea University of Science
More informationSummary and discussion of: Dropout Training as Adaptive Regularization
Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial
More informationarxiv: v2 [cs.cl] 1 Jan 2019
Variational Self-attention Model for Sentence Representation arxiv:1812.11559v2 [cs.cl] 1 Jan 2019 Qiang Zhang 1, Shangsong Liang 2, Emine Yilmaz 1 1 University College London, London, United Kingdom 2
More informationGov 2002: 9. Differences in Differences
Gov 2002: 9. Differences in Differences Matthew Blackwell October 30, 2015 1 / 40 1. Basic differences-in-differences model 2. Conditional DID 3. Standard error issues 4. Other DID approaches 2 / 40 Where
More informationWhen Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data?
When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data? Kosuke Imai Princeton University Asian Political Methodology Conference University of Sydney Joint
More informationAnalysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function
Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function Austin Wang Adviser: Xiuyuan Cheng May 4, 2017 1 Abstract This study analyzes how simple recurrent neural
More informationRecurrent Autoregressive Networks for Online Multi-Object Tracking. Presented By: Ishan Gupta
Recurrent Autoregressive Networks for Online Multi-Object Tracking Presented By: Ishan Gupta Outline Multi Object Tracking Recurrent Autoregressive Networks (RANs) RANs for Online Tracking Other State
More informationShort-term water demand forecast based on deep neural network ABSTRACT
Short-term water demand forecast based on deep neural network Guancheng Guo 1, Shuming Liu 2 1,2 School of Environment, Tsinghua University, 100084, Beijing, China 2 shumingliu@tsinghua.edu.cn ABSTRACT
More informationStatistical Inference for Average Treatment Effects Estimated by Synthetic Control Methods
Statistical Inference for Average Treatment Effects Estimated by Synthetic Control Methods Kathleen T. Li Marketing Department The Wharton School, the University of Pennsylvania March 2018 Abstract The
More informationSeq2Tree: A Tree-Structured Extension of LSTM Network
Seq2Tree: A Tree-Structured Extension of LSTM Network Weicheng Ma Computer Science Department, Boston University 111 Cummington Mall, Boston, MA wcma@bu.edu Kai Cao Cambia Health Solutions kai.cao@cambiahealth.com
More informationDeep Counterfactual Networks with Propensity-Dropout
with Propensity-Dropout Ahmed M. Alaa, 1 Michael Weisz, 2 Mihaela van der Schaar 1 2 3 Abstract We propose a novel approach for inferring the individualized causal effects of a treatment (intervention)
More informationConvolutional Neural Networks II. Slides from Dr. Vlad Morariu
Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate
More informationESTIMATION OF TREATMENT EFFECTS VIA MATCHING
ESTIMATION OF TREATMENT EFFECTS VIA MATCHING AAEC 56 INSTRUCTOR: KLAUS MOELTNER Textbooks: R scripts: Wooldridge (00), Ch.; Greene (0), Ch.9; Angrist and Pischke (00), Ch. 3 mod5s3 General Approach The
More informationGov 2002: 3. Randomization Inference
Gov 2002: 3. Randomization Inference Matthew Blackwell September 10, 2015 Where are we? Where are we going? Last week: This week: What can we identify using randomization? Estimators were justified via
More informationTrajectory-based Radical Analysis Network for Online Handwritten Chinese Character Recognition
Trajectory-based Radical Analysis Network for Online Handwritten Chinese Character Recognition Jianshu Zhang, Yixing Zhu, Jun Du and Lirong Dai National Engineering Laboratory for Speech and Language Information
More informationarxiv: v1 [cs.lg] 7 Apr 2017
A Dual-Stage Attention-Based Recurrent Neural Network for Time Series Prediction Yao Qin, Dongjin Song 2, Haifeng Cheng 2, Wei Cheng 2, Guofei Jiang 2, and Garrison Cottrell University of California, San
More informationarxiv: v1 [cs.lg] 2 Feb 2018
Short-term Memory of Deep RNN Claudio Gallicchio arxiv:1802.00748v1 [cs.lg] 2 Feb 2018 Department of Computer Science, University of Pisa Largo Bruno Pontecorvo 3-56127 Pisa, Italy Abstract. The extension
More informationLearning Chaotic Dynamics using Tensor Recurrent Neural Networks
Rose Yu 1 * Stephan Zheng 2 * Yan Liu 1 Abstract We present Tensor-RNN, a novel RNN architecture for multivariate forecasting in chaotic dynamical systems. Our proposed architecture captures highly nonlinear
More informationMatrix Completion Methods. for Causal Panel Data Models
Matrix Completion Methods for Causal Panel Data Models Susan Athey, Mohsen Bayati, Nikolay Doudchenko, Guido Imbens, & Khashayar Khosravi (Stanford University) NBER Summer Institute Cambridge, July 27th,
More informationIntroduction to Deep Learning CMPT 733. Steven Bergner
Introduction to Deep Learning CMPT 733 Steven Bergner Overview Renaissance of artificial neural networks Representation learning vs feature engineering Background Linear Algebra, Optimization Regularization
More informationSelection on Observables: Propensity Score Matching.
Selection on Observables: Propensity Score Matching. Department of Economics and Management Irene Brunetti ireneb@ec.unipi.it 24/10/2017 I. Brunetti Labour Economics in an European Perspective 24/10/2017
More informationEmpirical approaches in public economics
Empirical approaches in public economics ECON4624 Empirical Public Economics Fall 2016 Gaute Torsvik Outline for today The canonical problem Basic concepts of causal inference Randomized experiments Non-experimental
More informationarxiv: v1 [cs.lg] 28 Dec 2017
PixelSNAIL: An Improved Autoregressive Generative Model arxiv:1712.09763v1 [cs.lg] 28 Dec 2017 Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, Pieter Abbeel Embodied Intelligence UC Berkeley, Department of
More informationInference for Factor Model Based Average Treatment
Inference for Factor Model Based Average Treatment Effects Kathleen T. Li Marketing Department The Wharton School, the University of Pennsylvania January, 208 Abstract In this paper we consider the problem
More informationDeep Learning Architecture for Univariate Time Series Forecasting
CS229,Technical Report, 2014 Deep Learning Architecture for Univariate Time Series Forecasting Dmitry Vengertsev 1 Abstract This paper studies the problem of applying machine learning with deep architecture
More informationEdinburgh Research Explorer
Edinburgh Research Explorer Nematus: a Toolkit for Neural Machine Translation Citation for published version: Sennrich R Firat O Cho K Birch-Mayne A Haddow B Hitschler J Junczys-Dowmunt M Läubli S Miceli
More informationRecurrent Neural Networks with Flexible Gates using Kernel Activation Functions
2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors: S. Scardapane, S. Van Vaerenbergh,
More informationarxiv: v1 [cs.cl] 22 Jun 2015
Neural Transformation Machine: A New Architecture for Sequence-to-Sequence Learning arxiv:1506.06442v1 [cs.cl] 22 Jun 2015 Fandong Meng 1 Zhengdong Lu 2 Zhaopeng Tu 2 Hang Li 2 and Qun Liu 1 1 Institute
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationarxiv: v1 [stat.ml] 18 Nov 2017
MinimalRNN: Toward More Interpretable and Trainable Recurrent Neural Networks arxiv:1711.06788v1 [stat.ml] 18 Nov 2017 Minmin Chen Google Mountain view, CA 94043 minminc@google.com Abstract We introduce
More informationBased on the original slides of Hung-yi Lee
Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services
More informationA Bayesian Perspective on Residential Demand Response Using Smart Meter Data
A Bayesian Perspective on Residential Demand Response Using Smart Meter Data Datong-Paul Zhou, Maximilian Balandat, and Claire Tomlin University of California, Berkeley [datong.zhou, balandat, tomlin]@eecs.berkeley.edu
More informationDeep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng
Deep Sequence Models Context Representation, Regularization, and Application to Language Adji Bousso Dieng All Data Are Born Sequential Time underlies many interesting human behaviors. Elman, 1990. Why
More informationIndependent and conditionally independent counterfactual distributions
Independent and conditionally independent counterfactual distributions Marcin Wolski European Investment Bank M.Wolski@eib.org Society for Nonlinear Dynamics and Econometrics Tokyo March 19, 2018 Views
More informationOn the use of Long-Short Term Memory neural networks for time series prediction
On the use of Long-Short Term Memory neural networks for time series prediction Pilar Gómez-Gil National Institute of Astrophysics, Optics and Electronics ccc.inaoep.mx/~pgomez In collaboration with: J.
More informationApplied Microeconometrics (L5): Panel Data-Basics
Applied Microeconometrics (L5): Panel Data-Basics Nicholas Giannakopoulos University of Patras Department of Economics ngias@upatras.gr November 10, 2015 Nicholas Giannakopoulos (UPatras) MSc Applied Economics
More informationDynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji
Dynamic Data Modeling, Recognition, and Synthesis Rui Zhao Thesis Defense Advisor: Professor Qiang Ji Contents Introduction Related Work Dynamic Data Modeling & Analysis Temporal localization Insufficient
More informationabstract sergio firpo and vítor possebom april 2 nd, 2016
SY N T H E T I C C O N T R O L E S T I M AT O R : A G E N E R A L I Z E D I N F E R E N C E P R O C E D U R E A N D C O N F I D E N C E S E T S * sergio firpo and vítor possebom april 2 nd, 2016 abstract
More informationRobust Synthetic Control
Journal of Machine Learning Research 19 (2018) 1-51 Submitted 12/17; Revised 5/18; Published 8/18 Robust Synthetic Control Muhammad Amjad Operations Research Center Massachusetts Institute of Technology
More informationRegression Discontinuity Designs
Regression Discontinuity Designs Kosuke Imai Harvard University STAT186/GOV2002 CAUSAL INFERENCE Fall 2018 Kosuke Imai (Harvard) Regression Discontinuity Design Stat186/Gov2002 Fall 2018 1 / 1 Observational
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationKey-value Attention Mechanism for Neural Machine Translation
Key-value Attention Mechanism for Neural Machine Translation Hideya Mino 1,2 Masao Utiyama 3 Eiichiro Sumita 3 Takenobu Tokunaga 2 1 NHK Science & Technology Research Laboratories 2 Tokyo Institute of
More informationLong-term Forecasting using Tensor-Train RNNs
Long-term Forecasting using Tensor-Train RNNs Rose Yu Stephan Zheng Anima Anandkumar Yisong Yue Department of Computing and Mathematical Sciences Caltech, Pasadena, CA {rose, stephan, anima, yyue}@caltech.edu
More informationOPTIMIZATION METHODS IN DEEP LEARNING
Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate
More informationRecurrent Residual Learning for Sequence Classification
Recurrent Residual Learning for Sequence Classification Yiren Wang University of Illinois at Urbana-Champaign yiren@illinois.edu Fei ian Microsoft Research fetia@microsoft.com Abstract In this paper, we
More information8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7]
8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7] While cross-validation allows one to find the weight penalty parameters which would give the model good generalization capability, the separation of
More informationImplicitly-Defined Neural Networks for Sequence Labeling
Implicitly-Defined Neural Networks for Sequence Labeling Michaeel Kazi MIT Lincoln Laboratory 244 Wood St, Lexington, MA, 02420, USA michaeel.kazi@ll.mit.edu Abstract We relax the causality assumption
More informationLong Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch
Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, 2015 @ Deep Lunch 1 Why LSTM? OJen used in many recent RNN- based systems Machine transla;on Program execu;on Can
More informationarxiv: v1 [econ.em] 18 Nov 2017
Robust Synthetic Control Muhammad Jehangir Amjad, Devavrat Shah, and Dennis Shen arxiv:1711.06940v1 [econ.em] 18 Nov 2017 Laboratory for Information and Decision Systems, Statistics and Data Science Center,
More informationDiversifying Neural Conversation Model with Maximal Marginal Relevance
Diversifying Neural Conversation Model with Maximal Marginal Relevance Yiping Song, 1 Zhiliang Tian, 2 Dongyan Zhao, 2 Ming Zhang, 1 Rui Yan 2 Institute of Network Computing and Information Systems, Peking
More informationarxiv: v1 [econ.em] 19 Feb 2019
Estimation and Inference for Synthetic Control Methods with Spillover Effects arxiv:1902.07343v1 [econ.em] 19 Feb 2019 Jianfei Cao Connor Dowd February 21, 2019 Abstract he synthetic control method is
More informationA Machine Learning Approach for Virtual Flow Metering and Forecasting
To appear in Proc. of 3rd IFAC Workshop on Automatic Control in Offshore Oil and Gas Production, Esbjerg, Denmark, May 30 June 01, 2018. A Machine Learning Approach for Virtual Flow Metering and Forecasting
More informationLong-Short Term Memory and Other Gated RNNs
Long-Short Term Memory and Other Gated RNNs Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Sequence Modeling
More informationIMPROVING STOCHASTIC GRADIENT DESCENT
IMPROVING STOCHASTIC GRADIENT DESCENT WITH FEEDBACK Jayanth Koushik & Hiroaki Hayashi Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA {jkoushik,hiroakih}@cs.cmu.edu
More informationSupporting Information. Controlling the Airwaves: Incumbency Advantage and. Community Radio in Brazil
Supporting Information Controlling the Airwaves: Incumbency Advantage and Community Radio in Brazil Taylor C. Boas F. Daniel Hidalgo May 7, 2011 1 1 Descriptive Statistics Descriptive statistics for the
More informationRECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS
2018 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 17 20, 2018, AALBORG, DENMARK RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS Simone Scardapane,
More informationReview Networks for Caption Generation
Review Networks for Caption Generation Zhilin Yang, Ye Yuan, Yuexin Wu, Ruslan Salakhutdinov, William W. Cohen School of Computer Science Carnegie Mellon University {zhiliny,yey1,yuexinw,rsalakhu,wcohen}@cs.cmu.edu
More information