Partially Identified Treatment Effects for Generalizability

Partially Identified Treatment Effects for Generalizability Wendy Chan Human Development and Quantitative Methods Division, Graduate School of Education University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA (Dated: June 24, 2016) 1

Partially Identified Treatment Effects for Generalizability Abstract Recent methods to improve generalizations from nonrandom samples typically invoke assumptions such as the strong ignorability of sample selection that are often controversial in practice to derive point estimates. Rather than focus on the point estimate based inferences, this article considers inferences on partially identified estimates from fewer and weaker assumptions. We extend partial identification methods to causal generalization with nonrandom samples by using a cluster randomized trial in education. Bounds on the population average treatment effect are derived under four cases, two under no assumptions on the data, and two that assume bounded sample variation and monotonicity of response. This approach is amenable to incorporating population data frames to tighten bounds on the population average treatment effect. Under the assumptions of bounded sample variation and monotonicity, the interval estimates of the average treatment effect provide sufficiently informative bounds to rule out large treatment effects, which are consistent with the point estimates from the experimental study. This illustrates that partial identification methods can provide an alternative perspective to causal generalization in the absence of strong ignorability of sample selection. 2

Policymakers have become increasingly interested in the extent to which inferences from experimental studies apply to target populations of inference. However, in social science research, a common challenge to the generalizability of experimental results is that the samples used in experimental studies are typically not randomly selected (Greenberg & Shroder, 2004; Olsen et al., 2013). Without equal probability sampling, generalization of treatment effects is difficult since the bias induced from self-selection no longer allows for model free estimation of treatment effects (Keiding and Louis, 2016). Recently, statisticians have developed methods to improve generalizations from non-probability samples primarily by using propensity scores (Stuart et al., 2011; Tipton, 2012; O Muircheartaigh and Hedges, 2014). Propensity score methods match experimental samples to an inference population based on observable characteristics so that the two groups are compositionally similar (Rosenbaum and Rubin, 1984). These methods also extend the methods used to adjust for non-response in survey sampling. Such post-hoc adjustments facilitate bias-reduced estimates of the average treatment effect. While propensity score methods have significantly contributed to causal generalization, the assumptions required for these methods are often strong and even controversial. In particular, strong ignorability of sample selection imposes restrictions on the distribution of the outcome variable, which may not hold in practice. Consequently, the resulting inferences have limited applicability since they lack real world connections to empirical problems. Frequently, the credibility of inferences from experimental studies is questionable as the strength of the assumptions increases (Manski, 1990). Furthermore, different types of assumptions may lead to conflicting conclusions, presenting a challenge for replication. In causal generalization, strong ignorability of sample selection is needed for point identification of average treatment effects. This paper considers the inferences that can be made 3

with fewer and weaker assumptions when the treatment effects are partially identified. These interval estimates, rather than point estimates, serve as an intermediary between the empirical evidence (the data itself) and point estimation. The goal of partial identification methods is to make more credible and applicable conclusions by invoking assumptions that are more plausible in practice (Manski, 2009). In this paper, we apply partial identification methods to the problem of causal generalization in the absence of equal probability sampling. We situate our problem around a completed cluster randomized trial (CRT) in education that exhibits characteristics typical of randomized experiments in the social science setting such as the experimental sample size compared to the population. We choose this CRT in particular since the preliminary analysis of the results was completed in Konstantopoulous, et al. (2013) and the generalizability of point estimates of the treatment effect was considered in Tipton, et al. (2016). Three important questions are of interest. First, are interval estimates of average treatment effects informative when few to no assumptions are made on the experimental data? This question is addressed by examining bounds on the estimated treatment effects under two cases using known features of the experimental outcomes and design. The first case makes no assumptions on the empirical evidence and the second case examines interval estimation when treatment is randomized. Second, can supplemental sources of data improve upon the interval estimates of treatment effects in causal generalization? This question is addressed by considering differences in the interval estimates of treatment effects when auxiliary information is used to contribute identifying power. Third, what inferences can be made when assumptions that are weaker than strong ignorability, such as bounded sample variation and monotonicity, are imposed? To investigate this issue, we consider bounds when the two assumptions are made on the 4

experimental outcomes and compare the interval estimates to the ones under the no assumptions framework. Additionally, a fusion approach combining the bounding methods with propensity score subclassification is used to analyze differences in bounds when the population is stratified into matched subclasses. Lastly, we apply the interval estimation framework to the empirical example and examine differences in estimates with varying levels of assumptions. This section examines the role that assumptions play in narrowing interval estimates and compares these interval estimates with the point estimates derived assuming strong ignorability of sample selection. Through this comparison, we bring attention to partial identification as an alternative perspective to causal generalization and argue that inferences from a range of values can still be informative to researchers and policymakers. The goal of this paper is to highlight issues when generalization inferences rely on strong ignorability and instead consider what can be said in the absence of this assumption. CRT Example In 2006, the Indiana Department of Education and the Indiana State Board of Education managed the implementation of a new assessment system to measure annual student growth and to provide feedback to teachers (Konstantopoulos et al., 2013). During the 2009-2010 academic year, 56 K-8 (elementary to middle) schools from the state of Indiana volunteered to implement the new system, of which 34 were randomly assigned to the state's assessment system while 22 served as control schools. In the treatment schools, students were given four diagnostic assessments that were aligned with the Indiana state test and their teachers received online reports on their performance to dynamically guide their instruction in the periods leading up to the state exam. The effectiveness of the assessment system was measured using the Indiana 5

Statewide Testing for Educational Progress-Plus (ISTEP+) scores in English Language Arts (ELA) and mathematics. For each study school, the ISTEP+ scores were discretized using the minimum cutoff scores from the Indiana Department of Education and aggregated as either Pass and Not Pass. A natural question from this study is, if every school in Indiana were to implement this system, what is the expected impact on student achievement? In other words, to what extent do the results from the Indiana CRT generalize to the entire state? If the principal investigators of this study planned with generalization in mind, treatment randomization and random sampling would have been implemented to facilitate causal generalization by direct computation of the estimates of the population average treatment effect (PATE) from the data. However, in this situation, whether this convenience sample is representative of the population is questionable (Kruskal and Mosteller, 1980). Nonrandom samples introduce a complication when there are potential systematic differences between self-selected schools and the non-sampled schools in the population, which requires assumption-based methods to produce bias-reduced estimates of the PATE. Notation and Assumptions In this section, we formally conceptualize the PATE in causal generalization and introduce the notation that is used throughout this article. The estimation of causal treatment effects is framed using Rubin's Causal Model (Rubin 1974, 1977, 1980, 1986). Let P denote the population of inference consisting of N schools, of which n schools are selected into the sample. Let W be an indicator of treatment assignment where Wi = 1 if school i was assigned to implement the assessment system (treatment) and Wi = 0 if school i was not assigned to implement the system (control). For each school in P, let Yi(W) denote the binary potential 6

outcome of school i to indicate whether school i received a Pass score or not under the respective treatment condition (W=0,1). Finally, let Z be an indicator of sample selection where Zi = 1 if school i was selected into the experimental study and Zi = 0 if not selected. The treatment effect for school i is defined as τi = Yi(1) - Yi(0), and because P consists of schools that were selected and not selected into the experimental study, we can define two average treatment effects τsate = E(τ Z=1) τpate = E(τ Z=1) * P(Z=1) + E(τ Z=0) * (1-P(Z=1)) where τsate is the expected sample average treatment effect (SATE) and τpate is the expected PATE. Using the law of iterated expectations, the expected values of the potential outcomes for the SATE is decomposed into the following E(Y(1)) = E(Y(1) W=1) * P(W=1) + E(Y(1) W=0) * P(W=0) (1) E(Y(0)) = E(Y(0) W=1) * P(W=1) + E(Y(0) W=0) * P(W=0) (2) The PATE is a function of both W and Z so that the decomposition for this estimator is given by E(Y(1)) = E(Y(1) W=1, Z=1) * P(W=1, Z=1) + E(Y(1) W=0, Z=1) * P(W=0, Z=1) + E(Y(1) W=1, Z=0) * P(W=1, Z=0) + E(Y(1) W=0, Z=0) * P(W=0, Z=0) (3) E(Y(0)) = E(Y(0) W=1, Z=1) * P(W=1, Z=1) + E(Y(0) W=0, Z=1) * P(W=0, Z=1) + E(Y(0) W=1, Z=0) * P(W=1, Z=0) + E(Y(0) W=0, Z=1) * P(W=0, Z=1) (4) Since each study school is assigned to at most one treatment, the quantities in Equations (1) (4) cannot be estimated, a premise of the Fundamental Problem of Causal Inference (Holland, 1986). For the SATE, the potential outcomes E(Y(1) W=0) and E(Y(0) W=1) are unobservable counterfactuals since they refer to the expected outcome under treatment (control) when assigned control (treatment), which are unknown (Greenland et al., 1999; Dawid, 2000). Similarly, the PATE cannot be estimated as it is also a function of unobservable counterfactuals. 7

It is important to note that the decomposition of the potential outcomes for the PATE are different from those of the SATE on two important aspects. First, the potential outcomes in (3) and (4) necessarily include additional counterfactual terms because the PATE requires information about the sample selection variable of the units (denoted by the indicator Z). Second, the PATE incorporates two additional counterfactuals, E(Y(w) W=w, Z=0) and E(Y(w) W=w, Z=0), for w,w ={0,1} and w w, which will be referred to as the sample counterfactuals. These are the potential outcomes under a treatment condition for units not selected into the experimental study. E(Y(1) W=0, Z=1) and E(Y(0) W=1, Z=1), which we will refer to as the treatment counterfactuals, coincide with the unobservable counterfactuals for the SATE in Equations (1) and (2). Note that these two types of counterfactuals are unobservable for different reasons. The goal of any causal inference study is to identify both treatment and sample counterfactuals, whether it is through the design stage of the study or through the use of assumptions on the distribution of these potential outcomes (Rubin, 2011). Under treatment randomization and equal probability sampling, the potential outcomes Y(W,Z) are statistically independent from the treatment and sample selection variables, W, Z, respectively. Under treatment randomization, Y W so that E(Y(w, z)) = E(Y(w, z) W=w) for w, z={0,1} and the distribution of the unobservable treatment counterfactuals is equivalent to that of the realized potential outcomes (Rubin, 1974; Imai et al., 2008). Similarly, under equal probability sampling, Y Z and E(Y(w, z)) =E(Y(w, z) Z=z) for w, z={0,1} so that the unobservable sample counterfactuals are equal in distribution to the observable potential outcomes among units selected into the sample (that is, Z=1). In this case, the treatment effects can be estimated model free and any unbiased estimator of the SATE will also be unbiased for the PATE. 8

Role of Assumptions The challenge in generalization studies, as illustrated in the Indiana CRT, is that while treatment is assigned randomly, the experimental sample is not randomly selected so that point identification of the PATE is infeasible solely due to the unobservable sample counterfactuals. Researchers then typically invoke assumptions such as strong ignorability of sample selection in order to account for the unknown potential outcomes. Because the Indiana CRT was a randomized experiment, we focus specifically on strong ignorability of sample selection henceforth. Under strong ignorability, given a vector of observable covariates X for all schools in P, (Y Z) X and since the conditional distribution of the potential outcomes is independent of sample selection indicator Z, we have (Tipton, 2012): E((Y(W=w, z) W=w, Z) X) = E(Y((W=w, z) W=w) X) for w, z={0,1} (5) A main stipulation of strong ignorability is that all the covariates that affect sample selection and moderate treatment effects are captured by X, and by conditioning on X, the selection process can be considered random (that is, a probability sample). As a result, the sample counterfactuals are conditionally equal in distribution to the realized potential outcomes, the PATE is fully identified and bias-reduced estimates of the PATE can be derived. Information on the covariates X is typically sourced from population data frames such as the Common Core of Data (CCD) or state longitudinal data systems (Stuart, et al., 2011; Tipton, 2012). Recent methods to improve generalizability use propensity scores to meet the strong ignorability assumption (Rosenbaum and Rubin, 1983; Stuart, et al., 2011; Tipton, 2012). Propensity scores, which we denote by s(x), model the probability of sample membership as a function of X and have the advantage of being balancing scores where matching by the propensity score is equivalent to matching by the covariates in the propensity score model 9

(Rosenbaum and Rubin, 1983). Under strong ignorability using propensity scores, E((Y(w, z) W=w, Z) s(x)) = E(Y((W=w, z) W=w) s(x)) for w, z={0,1} and s(x) ϵ (0,1] (Tipton, 2012). If strong ignorability using propensity scores holds, the resulting inferences are made using the conditional distribution of the potential outcomes. Whether the strong ignorability assumption is credible and plausible in practice is a controversial topic. At the heart of the matter, strong ignorability is an invariance assumption which argues that the effect of the assessment system is the same (invariant) for students, on average, regardless of whether the school volunteered in the Indiana CRT once the propensity scores are taken into account. In other words, if strong ignorability holds, self-selection does not matter because any differences between the volunteer schools and the rest of the Indiana population schools is explained by the propensity scores. Is this assumption credible? Conceivably, strong ignorability may not hold for a few reasons. For example, schools that respond differently to the assessment system may have a strong support base from parents, which may not observable in the population data frames. Some schools that chose not to volunteer in the Indiana CRT may have an assessment system already in place and the impact of the CRT system may be different for these schools compared to schools that did not have such a system or such resources to begin with for their students. If these characteristics of schools are not observed in X, strong ignorability does not hold. Note that the concern is not solely due to the inability to observe these characteristics, but that these potential covariates may explain treatment variability and they are not included in the propensity score model. In addition, strong ignorability would also not hold if there are schools in P whose estimated propensity score s(x) = 0. This refers to any school whose probability of selection into the sample is structurally zero. 10

This may occur if the sample consisted of all single gender schools and generalization to a population of co-educational schools was of primary interest. Partial Identification of the PATE Examples in which strong ignorability does not apply, as described in the previous section, are not uncommon. Manski (2009) first recommended that researchers begin analyses by considering what can be learned from the data alone, without any assumptions, so that a domain of consensus is the established starting point. In the absence of strong ignorability, weaker but plausible assumptions that do not fully identify the PATE can yield informative bounds so that inferences are not solely based on untestable claims on the data (Manski, 2009). In the following sections, we exclude the strong ignorability assumption and consider the inferences made on the partially identified PATE. Although strong ignorability is omitted, the interval estimates are derived with some preliminary assumptions. First, the stable unit treatment value assumption (SUTVA) and its extension to causal generalization are assumed (Tipton, 2012). A key stipulation of SUTVA is that the potential outcome Y(W,Z) of each school i depends only on the treatment and sample indicator of i and not on another school j where i j. Second, the bounds on the PATE are estimated assuming perfect compliance among the schools in the experimental study. While each of these assumptions is not modest, neither assumption alone is sufficient to point identify the PATE and excluding them would require additional assumptions on the potential outcomes Y(W,Z) for each unit in P. Assuming SUTVA and perfect compliance, the bounds for the PATE are derived under four cases. The first two make no assumptions on the data generation process while the third and fourth introduce weaker assumptions, weaker compared to strong ignorability, to achieve tighter bounds. We provide a roadmap of the cases and their respective sections in this article in Table 11

(1). Three important questions are of interest: first, how different are the inferences on the PATE based on strong ignorability compared to the inferences based on the data itself and are the latter inferences informative; second, what inferences can be made of the PATE with assumptions that are weaker than strong ignorability, and lastly, how can additional sources of information be used to tighten the bounds of the PATE? INSERT TABLE 1 ABOUT HERE Two Frameworks for Estimating Bounds on the PATE Because problems of generalizability focus on inferences from a sample to a population, a population frame is required in order to estimate the propensity score of being selected into the experimental sample for every school i in P (Shadish, 2010). The population data frame used in the Indiana CRT, sourced from the CCD, contains demographic information on students and schools as well as test scores over several years. Since these data frames enumerate all schools in P, they provide information on the sample counterfactuals in the decomposition of (3) and (4). While propensity score methods use the population data to model the selection probability, we propose using the data frame to present two frameworks for estimating bounds of the PATE. We define two frameworks, the full interval framework and the reduced interval framework which differ by the extent to which the population data frame is useful in providing information to tighten the bounds. The full interval framework uses the experimental sample data with no assumptions on the sample counterfactuals. Under this framework, the only observable quantities are the realized potential outcomes E(Y(w) W=w, Z=1) for w={0,1}, while the unobservable treatment and sample counterfactuals are replaced by known bounds of the outcome. The reduced interval framework uses the empirical evidence from the experimental sample and the population data frame (that is, both the study data and the population data from 12

the CCD) to identify the sample counterfactual E(Y(0) W=0, Z=0), the expected outcome under the control condition for schools that were assigned control (W=0) and that were not selected into the experimental sample (Z=0). A rationale for the use of the reduced interval framework lies in the idea that the control condition in educational experiments may be a business as usual (BaU) condition, where control schools continue implementing existing curricula or programs. The reduced interval framework considers the distribution of potential outcomes among schools not selected into the experiment (that is, Z=0) to be identified by the population data frame if the control condition was BaU. Because this was the case for the Indiana CRT, we argue that the non-experimental schools in the population were similarly exposed to the control condition so that their potential outcomes under control are identified by the population data frame. We compare the widths and magnitudes of the estimated bounds of the PATE under these two frameworks to assess the identifying power of the population data frame on the interval estimates. Note that the reduced interval framework requires additional information, specifically on the probabilities P(W=1 Z=0), P(W=0 Z=0) and P(Z=0). These three probabilities are all functions of P(Z=0), which is estimated by the proportion of the population that are not selected into the experimental sample. In practice, this probability can be estimated empirically if knowledge of budget constraints for the study was available. For example, principal investigators decide the scale of implementation within the funding parameters of the experimental study which informs empirical estimates of P(Z=0). The quantities P(W=1 Z=0) and P(W=0 Z=0) represent the probabilities of treatment assignment among non-selected schools, which are typically close to 0.5 in randomized experiments. Bounds Without Assumptions on the Data Generation Process 13

Using the two frameworks for bounds, we begin with a domain of consensus and consider the inferences made absent any additional assumptions on the data. The bounds in the following sections are derived using a binary Y so that the potential outcomes Y(W, Z) share the same lower and upper bound, {0,1}, for all units in P and the expectations E(Y(W, Z) W, Z) become the probabilities P(Y(W, Z)=1 W, Z). These bounds can easily be extended to bounded continuous Y, such as test scores, where the lower and upper bounds of Y are used in place of {0,1}. Cases in which Y is bounded on one side, but unbounded on the other have been discussed in Manski (2009), though their focus is on experimental studies and not causal generalization. Full Interval Framework Using (3) and (4), the lower and upper bound of the potential outcomes are derived by replacing the unobservable treatment and sample counterfactuals with 0 and 1, respectively. For w={0,1}, the bounds under the full interval framework are E(Y(w)) ϵ [E L (Y(w)), E U (Y(w))] (5) E L (Y(w)) = E(Y(w) W=w, Z=1) * P(W=w, Z=1) E U (Y(w)) = E L (Y(w)) + (1-P(W=w, Z=1)) The lower and upper bound of the PATE are given by the differences PATE L = E L (Y(1))-E U (Y(0)) (6) PATE U = E U (Y(1))-E L (Y(0)) The width of this bound is 1+ P(Z=0) and because its value is always at least 1, the sign of the PATE cannot be identified. Note that when P(Z=0) = 0, and all the units are selected into the experimental sample, the width 1+ P(Z=0) shrinks to 1, coinciding with the tightest possible width of the interval for the SATE under the same no assumptions framework. Reduced Interval Framework When the population data frame identifies the sample counterfactual E(Y(0) W=0, Z=0), no replacements are made for this potential outcome in the estimation of bounds. Because this 14

sample counterfactual pertains to the expected outcome under control, the lower and upper bounds for E(Y(1)) remain the same, but the bounds for E(Y(0)) become E(Y(0)) ϵ [E L (Y(0)), E U (Y(0))] (7) E L (Y(0)) = E(Y(0) W=0, Z=1) * P(W=0, Z=1) + E(Y(0) W=0, Z=0)*P(W=0, Z=0) E U (Y(0)) = E L (Y(0)) + (1-P(W=0, Z=1)-P(W=0, Z=0)) The bounds for the PATE are again given by (6) but now with the lower and upper bounds of E(Y(0)) given in (7). Note that the width of the bound for the PATE is now 1+ P(W=1,Z=0), which is smaller and at most the width for the full interval framework 1+ P(Z=0). Here, P(W=1, Z=0) is the probability for a school not selected into the experiment, to be assigned to the treatment condition. In the best case scenario when P(W=1,Z=0) = 0, the population data frame fully identifies the sample counterfactuals and the bounds for the PATE again shrink to have width 1. In the worst case, the reduced interval bounds are equivalent to those under the full interval framework. Bounds Under Random Treatment Assignment The bounds in (5) and (7) can be improved with randomized treatment. In randomized experiments, like the Indiana CRT, the outcomes Y(W, Z) are statistically independent of the treatment assigned so that E(Y(w)) = E(Y(w) W=w, Z=1). As a result, the treatment counterfactuals E(Y(w) W w, Z=1) for w={0,1} are equal in distribution to the realized potential outcomes and substitutions are only required for the sample counterfactuals. The bounds are now E(Y(w)) ϵ [E L (Y(w)), E U (Y(w))] (8) E L (Y(w)) = E(Y(w) W=w, Z=1) * P(Z=1) E U (Y(w)) = E L (Y(w)) + (1-P(Z=1)) for the full interval framework, but the width of (8) is now 2* P(Z=0), which is strictly tighter than the bound in (5) with exception to the case when P(Z=0) = 1. Under the reduced interval framework, the bounds for E(Y(0)) are given by 15

E(Y(0)) ϵ [E L (Y(0)), E U (Y(0))] (9) E L (Y(0)) = E(Y(0) W=0, Z=1) * P(Z=1) + E(Y(0) W=0, Z=0)*P(W=0, Z=0) E U (Y(0)) = E L (Y(0)) + (1-P(Z=1) - P(W=0, Z=0)) Here, the width of the bound is P(Z=0) + P(W=1, Z=0), which is smaller than all of the bounds given in the previous section and is at most equal to the width for the full interval case with treatment randomization. Note that the bounds under treatment randomization are narrower compared to their respective counterparts under the first case and the reduced interval bound is again potentially tighter when E(Y(0) W=0, Z=0) is identified by the population data frame. Bounded Sample Variation and Treatment Randomization The bounds under no assumptions and treatment randomization are often referred to as the worst case bounds, and they are conservative and often uninformative in terms of identifying the sign of the PATE (Manski, 2009). The worst case bounds represent the simplest bounds based on the data, but they can be tightened by adding assumptions. If strong ignorability is unlikely to hold in practice, are there weaker assumptions that may be more plausible? Weaker assumptions are those that are insufficient to point identify the PATE. To conceptualize weaker assumptions, it is helpful to return to the condition stipulated under strong ignorability. Under strong ignorability, E((Y(w) W=w, Z) s(x)) = E((Y(w) W=w) s(x)), which implies that the distribution of potential outcomes between the volunteer and non-volunteer schools are conditionally equal. In other words, the difference between the unobserved sample counterfactuals and the realized potential outcomes from the sample is essentially zero, on average, when conditioning on the estimated propensity scores s(x). Rather than set the difference to be exactly zero, a simple weakening of this invariance assumption is to restrict the difference to be at most a specified constant (Manski, 2015). In particular, a researcher may not be confident that the conditional distributions of ISTEP+ scores are exactly equal, but he/she 16

may be willing to believe that the distributions are sufficiently similar among the volunteer and non-volunteer schools in the experimental study. We quantify this similarity by extending the randomized treatment framework under a bounded sample variation assumption given as follows E(Y(w) W=w, Z=0) - E(Y(w) W=w, Z=1) λ E(Y(w) W w, Z=0) - E(Y(w) W w, Z=1) λ for w={0,1} (10) where λ ϵ [0,1] is a constant that represents the largest magnitude of the difference (Manski, 2015). Note that the condition in (10) is applied to both sets of sample counterfactuals. By design, bounded sample variation yields interval estimates of the PATE where the width of the intervals is based on λ. Bounded variation assumptions were first introduced in Manski (2015) and have been discussed in Manski & Pepper (2015) with applications to the impact of right to carry laws on crime rates. For the Indiana CRT, if bounded sample variation holds, the proportion of Pass schools differs by at most a constant λ between the volunteer and nonvolunteer schools for each respective treatment condition. Bounded sample variation yields new bounds of the PATE given by E L (Y(w)) = E(Y(w) W=w, Z=1) * P(Z=1) + (E(Y(w) W=w, Z=1) - λ) * P(Z=0) E U (Y(w)) = E(Y(w) W=w, Z=1) * P(Z=1) + (E(Y(w) W=w, Z=1) + λ) * P(Z=0) (11) for the full interval framework and E L (Y(0)) = E(Y(0) W=0, Z=1)* P(Z=1) + E(Y(0) W=0, Z=0)* p + (E(Y(0) W=0, Z=1) - λ)* (1- P(Z=1) - p) E U (Y(0)) = E(Y(0) W=0, Z=1)* P(Z=1) + E(Y(0) W=0, Z=0)* p + (E(Y(0) W=0, Z=1) + λ) * (1- P(Z=1) - p) (12) for the reduced interval framework where p = P(W=0, Z=0). Like the previous frameworks, the lower and upper bounds of the PATE are given by (6). Because Y is binary, the constant λ is constrained to the interval [0,1] as it represents a difference in estimated proportions. However, λ can be any positive constant when Y is 17

continuous. When λ = 0, the bounds reduce to a single value of the PATE so that point estimation is recovered. Larger values of λ imply larger differences between the expected potential outcomes from schools self-selecting into the experimental study, thus widening the interval estimates and weakening the assumption further. Importantly, bounded sample variation and strong ignorability are not nested. Bounded sample variation allows the potential outcomes among sample and non-sample schools to vary under the assumption that the outcomes are similar in magnitude, but not necessarily equal. To conceptualize the plausibility of this assumption, consider it in the context of matching. When experimental samples are matched to a population P in causal generalization, the goal is to achieve balance among observable covariates so that the resulting differences in distributions is minimized (see, e.g., Hansen (2004) for examples of matching). Here, balance is quantified as attaining the smallest standardized mean difference among covariates between the two groups. Conceivably, the difference in expected potential outcomes is smaller with matched samples if the expected outcomes Y are a function of the covariates that are balanced between the groups. In contrast to strong ignorability, the plausibility of bounded sample variation lies in the assumption that there is sufficient overlap between the distributions of potential outcomes among selfselected and population schools to facilitate the derivation of informative bounds. For the Indiana CRT, Tipton et al. (2016) found that the experimental sample was actually similar to the population P despite volunteer nature of the sampling. Using a simulation study, a comparison of the distribution of the SMDs with simple random samples showed that the balance statistics of the CRT sample were not significantly different from those found for probability samples. Since the experimental sample is not very different from the 18

population, bounded sample variation is not implausible and we explore the inferences under this assumption in a direct application with the experimental study. Choice of λ The plausibility and applicability of bounded sample variation relies on the choice of λ. Previous work on bounded variation assumptions estimate λ based on prior outcome data (Manski & Pepper, 2015). However, empirical estimates of λ are difficult in our example as it represents the difference between the volunteer and non-volunteer schools outcomes in the Indiana CRT, which is specific to this study. Because bounded sample variation is related to the goals of matching methods, a natural choice for the parameter λ is the standardized mean difference (SMD). In particular, let λ = X P X S /σ P, where X is a pre-treatment covariate, X P is the mean covariate value for schools in the population, X S is the mean covariate value for the sample schools and σ P is the standard deviation of X across all schools in the population. This choice of parameter reflects the assumption that λ effectively measures the degree to which the sample and population is balanced on the specific covariate. As a result, larger values of the SMD suggest larger differences between the sample and population distributions, which is then reflected in wider intervals. To illustrate, when Y is a test score outcome, one possible choice for X is a pre-test score covariate that is strongly correlated with the outcome. Alternatively, because the SMD is typically estimated using multiple covariates, another choice of λ may be based on the average SMD of several covariates or the maximum of the SMD to use as a conservative estimate. Although bounded sample variation is not necessarily validated using these estimates of λ, we provide these choices as a starting point for applications of this assumption. Another choice for λ is the actual variation of the realized potential outcomes in the experimental study, specifically by letting λ= 2* Var(Y(w, z) W=w, Z = 1). This value of λ is 19

analogous to the margin of error in the construction of confidence intervals for normally distributed data. When Y(w, z) is binary, the variance is a function of the proportion of successes in the sample (that is, the proportion of Pass schools). Choosing λ in this option offers a conservative limit on the difference in E(Y) by setting the difference to be no more than two standard deviations from the empirical distribution of outcomes. This choice of λ is made assuming that there is sufficient overlap between the two distributions of potential outcomes. Monotone Treatment Response Bounded sample variation and strong ignorability focus on the average distance in distributions of potential outcomes, but neither assumption considers the actual shape of the response function Y(W,Z). If the principal investigators in the Indiana CRT were confident that the benchmark assessment system improves student outcomes, this leads to a new assumption with a different type of identifying power. In particular, researchers may be willing to assume that the response function Y(W,Z) varies monotonically with the magnitude of the treatment. The concept of monotonicity is related to the idea that the benchmark assessment system at least does no harm and at worst, produces outcomes that are not statistically different from the BaU condition. Because monotonicity is concerned with the response function, it is not nested within the strong ignorability assumption. Unlike strong ignorability and bounded sample variation, monotonicity stipulates that the response Y(W,Z) is weakly increasing in the treatment, regardless of self-selection. Although monotonicity cannot be validated empirically, evidence from pilot studies may suggest its plausibility. Pilot studies are often used to determine the feasibility of an experimental study, but also to assess the expected magnitude and sign of the treatment effect in the first iteration of the experiment. During the pilot year of the CRT, Konstantopoulous, et al. (2013) 20

found positive, though insignificant, treatment effects for a majority of the grades among the experimental schools. These insignificant findings were largely based on the unconditional model without covariate adjustments and the estimated treatment effects were made for both ELA and mathematics. We discuss the unconditional model as the interval estimates in this article are based on nonparametric bounding methods. While the pilot study results do not guarantee monotonicity s validity, they do lend support to its plausibility for the specific time period of the pilot program. If monotonicity holds, this assumption, combined with the assumptions of SUTVA and perfect compliance, yields a new set of bounds of the PATE with analogous extensions to the bounds of the prior frameworks. Monotonicity of the response function Y(W) stipulates the following condition W W Yi(W) Yi(W ) for i=1,, N where W, W represent two treatment conditions (Manski, 1997). The monotonicity framework differs from the previous cases on two important aspects. First, the treatment indicator W now denotes an ordered set of treatments so that the distribution of values of Y(W) are at least as large as that of Y(W ) if W W. For the Indiana CRT, if W=1, W =0, this implies that the average ISTEP+ scores among the treatment schools (W=1) are at least as large as that among the control schools (W=0) if monotonicity holds. Second, the bounds under monotone treatment response (MTR) are derived using realized values of Y(W, Z) for each combination of (W,Z) across all units in the study. The original monotonicity framework proposed in Manski (1997) assumed that outcomes at different levels of the treatment were observable, which contributes identifying power even if some levels of the treatment are not realized. The sharpness of the bounds is a consequence of the monotonic nature of the response function. Although the focus here is on 21

weakly increasing response functions Y(W, Z), the results can easily be generalized to weakly decreasing functions (Manski, 2009). If Y(W, Z) is weakly increasing in W, the lower bound of the PATE is zero by design. For binary Y and W, the upper bound is a function of the proportion of successes and non-successes among the treatment and control schools. Formally, the upper bound, PATE U is PATE U = P(Y=0 W=0, Z=1) * P(W=0, Z=1) + P(Y=1 W=1, Z=1) * P(W=1, Z=1) + P(Y=0 W=0, Z=0) * P(W=0, Z=0) + P(Y=1 W=1, Z=0) * P(W=1, Z=0) (13) Note that the upper bound is the sum of two components. The first, given by P(Y=0 W=0, Z=1) * P(W=0, Z=1) + P(Y=1 W=1, Z=1) * P(W=1, Z=1), is the sum of the proportion of non-pass schools (Y=0) among the schools assigned to control (W=0) and the proportion of Pass schools (Y=1) among the schools assigned to treatment (W=1), for schools that were selected into the experimental study (Z=1). If Y is truly monotone in the treatment, the largest feasible upper bound is the sum of the upper bound for Y(1) and the lower bound for Y(0). The second component, P(Y=0 W=0, Z=0) * P(W=0, Z=0) + P(Y=1 W=1, Z=0) * P(W=1, Z=0), is the sum of analogous proportions among the population schools (Z=0). Because MTR is based on realized values Y(W, Z), additional consideration must be taken for the upper bound as it is a function of schools not selected into the experimental sample (denoted by Z). If the population data frame contributes identifying power, as proposed under the previous reduced interval framework, the term P(Y=0 W=0, Z=0) is identified using the empirical evidence. However, the proportion P(Y=1 W=1, Z=0) is still an unobservable sample counterfactual and the only information that contributes identifying power are the known bounds {0,1} for the proportion. Because the upper bound depends on this sample counterfactual, an interval of values for PATE U can be derived by substituting 0 (1) for the minimum (maximum) 22

upper PATE bound. For the purpose of comparing the MTR assumption with the previous frameworks, we focus on the minimum of the PATE U as it is derived using the realized values Y(W,Z) from both the sample and the population data. This lower bound of the PATE U is a best case bound because it represents the smallest value PATE U takes under the monotonicity framework. Note that the smallest PATE U is derived by substituting 0 for the unobserved sample counterfactuals and the largest PATE U is derived by substituting the value 1. Like the full and reduced interval cases, two MTR bounds, denoted as the sample MTR bound and the population MTR bound, are presented and given by Sample MTR Bound PATE U = P(Y=0 W=0, Z=1) * P(W=0, Z=1) + P(Y= 1 W=1, Z=1)* P(W=1, Z=1) Population MTR Bound PATE U = P(Y=0 W=0, Z=1) * P(W=0, Z=1) + P(Y= 1 W=1, Z=1)* P(W=1, Z=1) + P(Y=0 W=0, Z=0)* P(W=0, Z=0) (14) Note that the bounds in (14) differ by the probability P(Y=0 W=0, Z=0), which is identified by the population data frame in the population MTR bound. Thus, if monotonicity holds and the treatment is believed to at least do no harm, the interval estimate of the PATE shrinks to lie to one side of zero. Bounds by Propensity Score Stratum The bounds developed thus far are estimated based on the entire experimental sample and population data frame (the latter for the reduced interval and population MTR cases). In generalizations with nonrandom samples, a primary goal in the study is to match the sample and population. Subclassification is a common matching method in which the population is partitioned into smaller subclasses or strata using quantiles of the propensity score distribution (Tipton, 2012). Figure (1) shows the distribution of propensity score logits (logit(s(x) = 23

log(s(x)/1-s(x))), with s(x) denoting propensity score for the Indiana CRT. For this study, the sample only permitted three equally sized strata. These strata represent coarse matches between the volunteer and non-volunteer schools based on the covariates used to estimate the propensity scores. As a statistical tool, subclassification shares the same advantages as stratification methods by improving the precision of estimates when schools in the same stratum are more similar in the matched covariates than between strata. INSERT FIGURE 1 ABOUT HERE Because the bounds are estimated nonparametrically from the data, they can also be computed for subgroups of the data to derive stratum specific interval estimates. In our final application, we propose a combined approach where the previous no assumption, bounded sample variation and monotonicity frameworks are applied with subclassification to derive stratum specific interval estimates of the PATE. Note that strong ignorability is not invoked as the extended application is based on subclassification as a matching method. Given k propensity score strata, the bounds [PATEj L, PATEj U ] are now estimated using the empirical distribution P(Yj Wj, Zj), for j = 1,, k. Because the outcome Y is bounded by the same parameters in each stratum, the interval estimates have the same form as the bounds based on the original sample. Analogous arguments for the plausibility of the bounded sample variation and monotonicity frameworks are extended for the propensity strata. Conceivably, bounded sample variation assumptions may be more plausible as the interval estimates are now based on matched subgroups with improved overlap in the covariate distributions among the strata. Alternatively, when subclassification is used (though not necessarily for generalization), this combined approach is useful in deriving preliminary bounds for sparse strata. Sparse strata, like Stratum 3 in Figure (1), represent regions in the distribution with poor overlap so that fewer 24

sample schools are matched to population schools. The small stratum specific sample sizes present challenges for design-based estimation as the limited information leads to imprecise estimates. In these cases, inferences based on the interval estimates serve as a starting point in the analysis and they may be more credible as the limited data from sparse strata require additional assumptions for point estimation. Furthermore, the nonparametric method of deriving the bounds offers a flexibility that can easily be extended for any number of subclasses. Note that propensity score subclassification depends on the propensity score model and the covariates used to estimate propensity scores so that choice of covariates that are correlated with sample selection and treatment heterogeneity is important. Bounds for the PATE in Indiana CRT Tables 2 and 3 provide the bounds for the four cases based on the experimental and stratified Indiana CRT samples, respectively. The PATE is defined as P(Y(1)=1) P(Y(0)=1), the difference in expected proportions of Pass schools among the 34 schools that implemented the assessment system (assigned treatment) and the 22 schools that did not (assigned control). To illustrate this application, 4 th grade scores were used so that the population P was redefined to include the N= 1,029 fourth grade serving schools from the original 1,514 K 8 schools. Using the experimental sample, the conditional probabilities of treatment assignment were estimated as P(W=1 Z=1) = 34/56 = 0.61 and P(W=0 Z=1) = 22/56 = 0.37. With the redefined P, the probability of selection into sample is given by P(Z=1) = 56/1029 0.05 and P(Z=0) = 1 P(Z=1). For the reduced interval framework, we derive the bounds using P(W=0 Z=0) = 0.5 * P(Z=0) to illustrate the potential of the population data to tighten bounds. This choice of P(W=0 Z=0) is implies randomized treatment among schools in the population. 25

Because the Indiana CRT was a randomized experiment, a natural starting place is with the bounds under Randomized Treatment, derived without additional assumptions beyond SUTVA and perfect compliance. As shown in Table 2, the bounds are distinctly uninformative with the interval nearly spanning the [-1, 1] range for both subjects so that the sign of the PATE cannot be identified. Although the bounds under randomized treatment improve over the No Assumptions bounds where the interval shrinks from [-0.97, 0.98] to [-0.93, 0.96] for ELA, the improvement is slight because the estimated probability of sample selection is small with P(Z=1) = 0.05. When the population data frame identifies P(Y(0) W=0, Z=0) in the reduced interval framework, the upper bound for P(Y(0)) under the No Assumptions and Randomized Treatment cases is now larger because it includes the proportion of Pass schools among those assigned to control in the population. Using P(W=0 Z=0) = 0.5*P(Z=0), the bounds for Math under the reduced interval framework shrink from [-0.93, 0.96] to [-0.87, 0.55] under randomized treatment, a 25% shrinkage in width. The tightening of the upper bound from 0.96 to 0.55 is more prominent as the difference is taken with a larger value of P(Y(0)). Analogous results are seen for ELA. However, the tighter bounds under the reduced interval framework do not rule out an insignificant PATE as each of the four bounds include zero for both ELA and Math. The bounds under no assumptions illustrate that when P(Z=1) is small, the interval estimates of the PATE are rarely informative so that some assumptions are needed. The third rows for each subject in Table 2 provide the bounds under bounded sample variation. For ELA, the value λ= 0.3 was chosen based on the variance of Y and λ= 0.5 was selected based on the SMD of the pretest scores. The values λ= 0.1 and 0.6 were chosen similarly for Math. For both 26

subjects, the smaller values of λ represent a smaller difference in the expected potential outcomes between the volunteer and population schools. When λ= 0.3, the expected difference in proportion of Pass schools ranges from -0.31 to 0.83 for ELA, a much tighter interval compared to the randomized treatment case. For Math, the smaller λ value of 0.1 facilitates identification of the sign of the PATE where the PATE ranges from 0.02 to 0.40. Note that these intervals are significantly widened when the value of λ increases to 0.5 and 0.6 for ELA and Math, respectively. Since λ affects both the lower and upper bound of P(Y(1)) and P(Y(0)), the impact on the bounds on the PATE is more pronounced after taking the difference. For the reduced interval framework, the upper bound PATE U is again much smaller than that of the full interval framework since the difference P(Y(1)) L P(Y(0)) U is now taken with a larger P(Y(0)) U. With exception to Math with λ= 0.1, the interval estimates suggest an insignificant PATE. The bounds for the PATE are starkly different under monotonicity of treatment response. Note that the upper bound PATE U in Table 2 is the smallest upper bound estimated using the realized potential outcomes. If the benchmark system was assumed to at least do no harm, the expected difference in proportion of Pass schools is at most 0.02 in both subjects based on the experimental data. This upper bound increases to 0.07 and 0.09 for ELA and Math, respectively, when the upper bound includes P(Y(0)=0 W=0, Z=0). However, since the difference between 0.02 and 0.07 for ELA is small, P(Y(0)=0 W=0, Z=0) is small, which implies that the proportion of non-pass schools in the population is small. Importantly, because the PATE is a function of P(Z=1), if this proportion is small, the bound PATE U will have small values. Although the MTR bounds include zero by design, the magnitude of the smallest PATE U for both subjects suggests that, using the realized outcomes, large values of the PATE can be ruled out even if small insignificant treatment effects cannot be excluded. 27

INSERT TABLE 2 ABOUT HERE Table 3 provides the bounds for the combined bounding and subclassification approach. Note that the stratum specific sample sizes are smaller under this approach with Stratum 3 containing only a single experimental school. Under randomized treatment, the stratum specific bounds for ELA and Math are largely similar to those based on the entire experimental sample. The sign of the PATE is unidentified for both subjects and the range of values under the full interval case primarily span the [-1, 1] range. The reduced interval bounds under randomized treatment have similar widths as under the original sample, but slight differences can be seen among the strata. For example, the difference in proportion of Pass schools for ELA and Math in Stratum 1 ranges from approximately -0.85 to 0.53 under the reduced interval framework, but the lower bound decreases to about -0.90 in Stratum 2. The differences among bounds is seen more distinctly with bounded sample variation and MTR. Using the same bounding parameters as with the original sample, the interval estimates under bounded sample variation suggest larger differences among the strata. In Math, for example, the sign of the PATE is identified in Stratum 1 when λ = 0.1 with an interval estimate of [0.12, 0.49]. However, the bounds in Stratum 2 imply an insignificant PATE, which illustrate potential differences in inferences among the strata. For MTR, the smallest PATE U in each stratum has a similar magnitude as that based on the original sample, with exception to Stratum 3 where the stratum sample is one school. Like the bounds under bounded sample variation, the small differences in magnitude of the interval values facilitate an analysis in the differences among the sample and population in each subclass. INSERT TABLE 3 ABOUT HERE 28

As a comparison, Table 4 gives the point estimates of the PATE under no weighting, inverse propensity weighting (IPW), and subclassification for ELA and Math. The no weighting case refers to the direct estimate of the PATE that does not use propensity scores. As shown, the point estimates for ELA and Math are all insignificant, a result that is largely consistent with the bounds provided. While the interval estimates for Math show a positive PATE under λ = 0.1 with bounded sample variation, the interval estimate is still consistent. In particular, the lower bound of 0.02 suggests that small insignificant treatment effects are possible. With exception to IPW, the point estimates under no weighting and subclassification are similar in magnitude to the smallest PATE U under MTR, further supporting insignificant results. The magnitudes of the point estimates in Table 4 illustrate the potential for different inferences with different methods. While strong ignorability was not invoked, the interval estimates provided were sufficiently informative on the insignificant PATE. Its consistency with the point estimates support the potential usefulness of the partially identified estimators. INSERT TABLE 4 ABOUT HERE In the original analysis, Konstantopoulos, et al. (2013) found significant treatment effects for fourth grade ELA using a two level hierarchical linear model with covariates. Using standardized continuous ISTEP+ scores, a significant PATE of 0.135 (0.057 standard error) was found based on the experimental sample with a model that included school and student level covariates. Importantly, the PATE using the model with treatment alone was not significant (PATE = 0.087 with 0.111 standard error). Although this point estimate is not directly comparable to the bounds based on binary outcomes provided here, it is important to note that the significance of the estimate and the resulting inferences depended on the choice of model. Discussion & Conclusion 29

This paper addresses the differences in inferences when strong ignorability of sample selection is not invoked in generalization problems. Because point estimates are typically desirable in experimental studies, strong ignorability assumptions are imposed with little discussion on its plausibility. By illustrating the application of bounding methods to causal generalization, we demonstrate the importance of thoughtful consideration into the types of assumptions that may be more plausible in practice. As shown by the differences in widths and magnitudes of the bounds, different assumptions, regardless of whether they have point identifying power, can significantly affect the inferences from a study. A primary goal of partial identification methods is to approach an analysis from a common starting point (Manski, 2009). For that reason, the nonparametric nature of the derived bounds in this paper offer a simple empirically-based approach as all the bounds are computed using observable and estimable quantities. Because models are not used to estimate the intervals, an advantage of bounding methods is that the results can be replicated by different researchers. Generalization problems incorporate both population and sample units so that the bounds for the PATE necessarily include the probability of sample selection (denoted by Z). The bounds under randomized treatment bounds with no additional assumptions are uninformative if the population to which the sample is generalized is three to five times larger. Because this is often the case for generalizations in educational research, assumptions are needed to derive more informative bounds on the PATE. While other assumptions are plausible in different problems, we focus on bounded sample variation and monotonicity as the former is a natural weakening of the strong ignorability assumption and the latter is considered when there is theoretical evidence on the impact of the intervention. While bounded sample variation and monotonicity are not 30

validated by our application, we present the interval estimates to illustrate the consistency in inferences with those of the point estimates under strong ignorability. Partial identification applications in generalizability have the advantage of additional sources of data. The primary motivation of our proposed reduced interval framework is to use additional empirical evidence to tighten the interval estimates in place of solely layering on assumptions. As shown in the example, population data frames have the potential to tighten bounds when information on the probabilities of treatment assignment is known among the nonselected schools. Although information on this probability may be sourced elsewhere, we argue the usefulness of population data frames to minimize the number of unobservable potential outcomes. The inclusion of population data was also revealing when comparing the bounds under the sample and population MTR frameworks. As it turns out, the small differences in magnitude of the upper bound PATE U implied small differences in the proportion of Pass schools among the treated and untreated schools. While interval estimates do not substitute for the point estimates used to inform policy, we present their application as an alternative perspective when weaker assumptions to strong ignorability are made in generalizations. Generalizations with nonprobability samples inevitably involve a discussion on the extent to which a self-selected sample differs from the population. Although the assumptions used in this article make conjectures about this difference, by using partial identification approaches, we show that the bounds may still be informative and a better case can be made for their credibility in practice. 31

References Dawid, A. P. (2000). Causal inference without counterfactuals. Journal of the American Statistical Association, 95(450), 407-424. Greenburg, D. & M. Shroder (2004). The Digest of Social Experiments, 3 rd ed. Washington, DC: The Urban Institute Press. Greenland, S., Pearl, J., & Robins, J. M. (1999). Causal diagrams for epidemiologic research. Epidemiology, 37-48. Hansen, B. B. (2004). Full matching in an observational study of coaching for the SAT. Journal of the American Statistical Association, 99(467), 609-618. Heckman, J. J., & Vytlacil, E. J. (2001). Instrumental variables, selection models, and tight bounds on the average treatment effect. In Econometric Evaluation of Labour Market Policies (pp. 1-15). Physica-Verlag HD. Holland, P. W. (1986). Statistics and causal inference. Journal of the American statistical Association, 81(396), 945-960. Imai, K., King, G., & Stuart, E. A. (2008). Misunderstandings between experimentalists and observationalists about causal inference. Journal of the Royal Statistical Society: Series A (Statistics in Society), 171(2), 481-502. Keiding, N., & Louis, T. A. (2016). Perils and potentials of self selected entry to epidemiological studies and surveys. Journal of the Royal Statistical Society: Series A (Statistics in Society), 179(2), 319-376. Konstantopoulos, S., Miller, S. R., & van der Ploeg, A. (2013). The impact of Indiana s system of interim assessments on mathematics and reading achievement. Educational Evaluation and Policy Analysis, 35(4), 481-499. Kruskal, W., & Mosteller, F. (1980). Representative sampling, IV: The history of the concept in statistics, 1895-1939. International Statistical Review/Revue Internationale de Statistique, 169-195. Manski, C. F. (1990). Nonparametric bounds on treatment effects. The American Economic Review, 80(2), 319-323. Manski, C. F. (1997). Monotone treatment response. Econometrica: Journal of the Econometric Society, 1311-1334. Manski, C. F. (2009). Identification for Prediction and Decision. Harvard University Press. Manski, C. F. (2016). Credible interval estimates for official statistics with survey nonresponse. Journal of Econometrics, 191(2), 293-301. 32

Manski, C. F., & Pepper, J. V. (2015). How Do Right-To-Carry Laws Affect Crime Rates? Coping With Ambiguity Using Bounded-Variation Assumptions (No. w21701). National Bureau of Economic Research. Olsen, R. B., Orr, L. L., Bell, S. H., & Stuart, E. A. (2013). External validity in policy evaluations that choose sites purposively. Journal of Policy Analysis and Management, 32(1), 107-121. O'Muircheartaigh, C., & Hedges, L. V. (2014). Generalizing from unrepresentative experiments: a stratified propensity score approach. Journal of the Royal Statistical Society: Series C (Applied Statistics), 63(2), 195-210. Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41-55. Rosenbaum, P. R., & Rubin, D. B. (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association, 79(387), 516-524. Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5), 688. Rubin, D. B. (1977). Assignment to Treatment Group on the Basis of a Covariate. Journal of Educational and Behavioral Statistics, 2(1), 1-26. Rubin, D. B. (1980). Comment. Journal of the American Statistical Association, 75, 591-593. Rubin, D.B. (1986). Statistics and causal inference: Comment: Which ifs have causal answers. Journal of the American Statistical Association, 81, 961 962. Rubin, D. B. (2011). Causal inference using potential outcomes. Journal of the American Statistical Association. Shadish, W. R. (2010). Campbell and Rubin: A primer and comparison of their approaches to causal inference in field settings. Psychological Methods, 15(1), 3. Stuart, E. A., Cole, S. R., Bradshaw, C. P., & Leaf, P. J. (2011). The use of propensity scores to assess the generalizability of results from randomized trials. Journal of the Royal Statistical Society: Series A (Statistics in Society), 174(2), 369-386. Tipton, E. (2012). Improving Generalizations From Experiments Using Propensity Score Subclassification Assumptions, Properties, and Contexts. Journal of Educational and Behavioral Statistics, 1076998612441947. Tipton, E., Hallberg, K., Hedges, L.V., & Chan, W. (2016). Implications of Small Samples for Generalizations: Adjustments and Rules of Thumb. Forthcoming in Evaluation Review. 33

Figure 1. Distribution of Propensity Score Logits for Indiana CRT Note: S1: Stratum 1; S2: Stratum 2; S3: Stratum 3 34