Lustick Consulting s ABM Verification & Validation September 2015 The key to Lustick Consulting s Verification & Validation (V&V) process is iterative analysis of model inputs, theory, outputs, and empirical indicators. V&V should not be a static, one- time event, but rather an ongoing process of hypothesizing, testing, and confirmation or disconfirmation. 1 In order to implement this goal, Option I of our project called for six quarterly Verification & Validation reports from August 2012 to December 2013 that included our system status, data inputs, and model forecasts. In addition, we performed a MESA Epistemological Decomposition of our Venezuela model in May 2012 to unpack the data and theory behind the model. 2 Lastly, we held multiple sessions with two Subject Matter Experts Dr. Allen Hicken and Dr. David Faris who helped us better understand how our models relate to the real- world. Both experts found our approach compelling with relation to their countries (Egypt, Indonesia, the Philippines, and Thailand). There are many other ways to measure verification and validation, including a model s consistency, prominence, and accuracy with respect to other research in the discipline. We can also measure the model's utility to users, since an effective model should be useful for some specified purpose. Of course, the classic way to validate a model in modern science is to create a ground truth dataset and use that real- world data to test the model in- and out- of- sample. In this section, we test the validity of our country models by fitting a simple logistic regression using key indicators from the ABM output to predict the likelihood of five Events of Interest (EOIs) from the ICEWS project: Domestic Political Crisis, Insurgency, Rebellion, Ethnic/Religious Violence, and International Crisis. However, forecasting of discrete events, per se, is not a fully appropriate measure of performance for our modeling technique. This shortcoming stems from the greater flexibility of our model, which seeks to provide information not only about what could happen or did happen, but also how and why those things could or did happen. Our model also can examine the likelihood of what could have happened and how, and why that event might have occurred. From a causal, theoretical, and policy planning point of view, it is crucial to understand that much of what actually happens is random, i.e. theoretically uninteresting and causally inaccessible. In other words, a model can be excellent, and be wrong about a particular forecast. The better and more difficult validation procedure must include comparison of patterns of forecasts to patterns of outcomes and not merely forecast discrete events. 1 For more information on Lustick Consulting s deep Verification & Validation work, see (Lustick and Tubin, 2012). 2 For more information on the Model Evaluation, Selection, and Assessment (MESA) project, see (Alicia, et.al., 2012).
Our computational steering process used for updating the models allows us to start running our models back to the beginning of 2001, but we treat the first three years of the run as a burn- in period, a common practice in agent- based modeling applications. We also start four of our models later than 2001 due to lack of data and severe regime shifts in some countries. In all models we gather thirteen key output variables that we believe are the most likely drivers of the ICEWS EOIs. These include: 1. Mobilization indicators: attacks, protests, lobbies, and victims 2. Dynamic Political Hierarchy (DPH) indicators: dominant, incumbent, regime, system, and non- system subscription and activation (for more information on the Dynamic Political Hierarchy, see Lustick et.al. 2012) We then fit a simple logistic regression using these variables to our five binary EOIs. All variables were included for all models, and no attempt was made to tweak the model for performance. We intend these results to be a proof of concept, not a set of stable LC forecasts. In order to gauge the accuracy of our results, we compare our in- sample and out- of- sample forecasts to the Ensemble Bayesian Model Averaging (EBMA) model from the ICEWS project (Montgomery, et al., 2012). In general, we have found that our models forecast events well and rival the EBMA in some cases. Figure 1: Brier scores by EOI and Country, broken down by the ABM and EBMA results Figure 1 above compares the Brier Scores 3 for the ABM and EBMA forecasts for each EOI and country. We can visualize these metrics by creating separation plots by EOI to show how many cases we have classified correctly in Figure 2. In the separation plots, predictions are ordered from least likely to most likely, and a red bar indicates a true EOI. An accurate separation plot tends to have high predictions during true events on the right side of the graph and low predictions during false events on the left side. We can also check our model s accuracy by looking at the Sensitivity (percentage of 1 s correct), Specificity (percentage of 0 s correct), and 3 https://en.wikipedia.org/wiki/brier_score
statistics 4 (see Table 1.) To calculate the metrics that require a cut- point, the value.5 was used. (This cut- point was chosen ahead of time, not calibrated to improve performance.) Figure 2: Separation Plot of in- sample and out- of- sample results Table 1: Agent- based Model Forecast Metrics, In- and Out- of- sample Domestic Political Crisis Rebellion International Crisis 82.3 46.65 95.54 4.46 53.35 51.29 83.85 82.24 84.54 15.46 17.76 63.96 91.68 74.23 96.24 3.76 25.77 73.75 Insurgency Ethnic/Religious Violence 79.26 48.54 96.97 3.03 51.46 55.15 88.54 63.76 95.8 4.2 36.24 65.33 4 is a method for measuring the correlation of binary variables. https://en.wikipedia.org/wiki/phi_coefficient
Figure 3: EOI forecasts for Yemen To give an example, Figure 3 shows four EOI forecasts for our Yemen model, comparing the predictions for both the EBMA and ABM models. Both models seem to do poorly in predicting the Yemen Rebellion (REB) or Domestic Political Crisis (DPC), but do modestly well in forecasting the Insurgency (INS) and Ethnic/Religious Violence (ERV) before it occurred. Note that forecasts made after July 2013 are completely out- of- sample. To show the volume of data being used for this analysis, we can look at the ABM output data in Figure 4 for each country- month combination within our timeframe. Changes in the ABM output data are caused by a combination of exogenous punctuations, continuous computational steering, and internal model dynamics. The x- axis for each country shows the month and the y- axis shows the value for each particular ABM output variable. The y- axis scales may vary for each country. Lobby, protest, attack, and victim are measures of the average number of agents during each country- month that are mobilizing in different ways, representing levels of discontent and isolation that rise to the level of mobilization. The DPH subscription variables are the average number of agents for each DPH level for which that DPH level is the highest to which it is subscribed. For example, high levels of system and non- system subscription tell us that there are many agents in the landscape that do not have a dominant, incumbent, or regime identity in their repertoire and are therefore alienated from the center of politics. The last set of variables, DPH activation, is the average number of agents that are activated on an identity in
each level of the DPH. High regime activation simply means that there are many agents that are activated on some regime identity, but could be subscribed to dominant, incumbent, or system identities. All of these variables have consistent meanings across countries and are therefore ideal for using in a large- N cross- country analysis. Figure 4: Main ABM output variables over time We also ran a similar experiment as part of the ME- CEWS project to forecast the number of violent events per province per week in seven Middle Eastern countries. We used the same method as described above, but used a simple ordinary least squares linear model with a floor of zero. Also, we used only a subset of the variables from the EOI country- level forecasts, including attack, protest, lobby, and our DPH subscription measures (eight variables). We again compare our model to the EBMA results, but different metrics need to be used in order to measure the accuracy of a count variable as opposed to a binary variable. In Table 2 we show the comparison between the ABM and EBMA validation metrics for weekly forecasts of violent events for 125 provinces between January 2004 and September 2015. We can see that the correlation is weak for both models, meaning that neither model did particularly well in forecasting violent events. Although the ABM Root Mean Squared Error is higher, our Mean Absolute Error is lower, indicating more outlier forecasts for the ABM. Perhaps
most interesting is that the correlation between the two models is only.38, which means that our model would likely improve the EBMA forecast if included in the averaging algorithm. Table 2: Empirical validation for violent event forecasts Agent- based Model Ensemble Bayesian (ABM) Model Averaging (EBMA) Root Mean Squared Error 10.99 10.26 Mean Absolute Error 2.74 2.79 Pearson s Correlation.33.46 To show the richness of the output, Figure 5 shows the weekly ABM output for each Syrian province between 2012 and 2017 (our forecast end date). We can see that there is significant variation from province to province due to differences in support for factions (Assad, Kurds, ISIS, Rebels), identity complexion (ethnic, religious, political), and steering data coming from media reports picked up by the ICEWS event data. Figure 5: ABM Output for Syrian Provinces The main advantage to a large N approach for problems like these is that the logistic regression can find patterns that a human could not. The downside is that all of the results are correlational, meaning that they don t necessarily describe real causal processes. Building a
large set of agent- based models, running and updating them for a fifteen- year period, and then validating the results using real- world data is an unprecedented step in the field of computational social science. Even more, showing that those results can compete with and complement the state- of- the- art in statistical forecasting is strong evidence that ABM holds great potential in the future of the discipline. Bibliography Lustick, Ian S. and Matthew Tubin. 2012. Verification as a Form of Validation: Deepening Theory to Broaden Application of DoD Protocols to the Social Sciences. Advances in Design for Cross- Cultural Activities. Lustick, Ian S., et al. 2012. From theory to simulation: the dynamic political hierarchy in country virtualisation models. Journal of Experimental & Theoretical Artificial Intelligence 24(3) Montgomery, Jacob M., Florian M. Hollenbach, and Michael D. Ward. 2012. Improving predictions using ensemble Bayesian model averaging. Political Analysis 20.3. Ruvinsky, Alicia I., Janet E. Wedgwood, and John J. Welsh. 2012. Establishing bounds of responsible operational use of social science models via innovations in verification and validation. Advances in design for cross- cultural activities Part II.