arxiv: v1 [stat.me] 21 Feb 2016

Size: px

Start display at page:

Download "arxiv: v1 [stat.me] 21 Feb 2016"

Lily Wheeler
5 years ago
Views:

1 Princial stratification in the Twilight Zone: Weakly searated comonents in finite mixture models Avi Feller UC Berkeley Evan Greif Harvard University Luke Miratrix Harvard University atesh Pillai Harvard University arxiv: v1 [stat.me] 21 Feb 2016 This version: Setember 26, 2018 Abstract Princial stratification is a widely used framework for addressing ost-randomization comlications in a rinciled way. After using rincial stratification to define causal effects of interest, researchers are increasingly turning to finite mixture models to estimate these quantities. Unfortunately, standard estimators of the mixture arameters, like the MLE, are known to exhibit athological behavior. We study this behavior in a simle but fundamental examle: a two-comonent Gaussian mixture model in which only the comonent means are unknown. Even though the MLE is asymtotically efficient, we show through extensive simulations that the MLE has undesirable roerties in ractice. In articular, when mixture comonents are only weakly searated, we observe ile u, in which the MLE estimates the comonent means to be equal, even though they are not. We first show that arametric convergence can break down in certain situations. We then derive a simle moment estimator that dislays key features of the MLE and use this estimator to aroximate the finite samle behavior of the MLE. Finally, we roose a method to generate valid confidence sets via inverting a sequence of tests, and exlore the case in which the comonent variances are unknown. Throughout, we illustrate the main ideas through an alication of rincial stratification to the evaluation of JOBS II, a job training rogram. afeller@berkeley.edu. AF and LM gratefully acknowledge financial suort from the Sencer Foundation through a grant entitled Using Emerging Methods with Existing Data from Multi-site Trials to Learn About and From Variation in Educational Program Effects, and from the Institute for Education Science (IES Grant #R305D150040). SP is artially suorted by an OR grant. We would like to thank Isaiah Andrews, Peter Bickel, Alex D Amour, Peng Ding, Fabrizia Mealli, Christian Robert, Don Rubin, Dylan Small, Weixin Yao, and members of the Sencer grou for helful comments and discussion, as well as seminar articiants at the 2015 Atlantic Causal Inference Conference and at the 2015 Joint Statistical Meetings. All oinions exressed in the aer and any errors that it might contain are solely the resonsibility of the authors.

2 1 Introduction Princial stratification is a widely used framework for addressing ost-randomization comlications in a rinciled way (Frangakis and Rubin, 2002). Tyically, the goal is to estimate a causal effect within a artially latent subgrou known as a rincial stratum. While there are many ossible ways to estimate these rincial causal effects, by far the most common aroach is via finite mixture models, treating the unknown rincial strata as mixture comonents (Imbens and Rubin, 1997). To date, scores of alied and methodological aers have relied on finite mixtures to estimate causal effects, both exlicitly and imlicitly. At the same time, it has long been conventional wisdom that finite mixture models can yield athological estimates (Redner and Walker, 1984). As Larry Wasserman remarked, finite mixture models are the Twilight Zone of Statistics (Wasserman, 2012). Our motivation for this aer is to understand how the athological features of finite mixture models affect inference for comonent-secific means, which is a central challenge in rincial stratification and is also of general scientific interest. Desite the wealth of literature on finite mixture models, there is relatively little research on the behavior of standard estimators like the Maximum Likelihood Estimate (MLE) in settings tyical of rincial stratification, namely when substantial searation between mixture comonents is unlikely. We take a first ste by carefully studying a simle but fundamental model: a two-comonent homoskedastic Gaussian mixture model, Y i iid π µ 0, σ 2 q ` 1 πq µ 1, σ 2 q, (1.1) in which only the comonent means, tµ 0, µ 1 u, are unknown. That is, we assume that the mixing roortion, π, is known and that 0 ă π ă We also assume that the within comonent variance, σ 2, is known and equal between comonents (we later relax these restrictions). As we show, identification for µ 0 and µ 1 is immediate and unambiguous. Moreover, when the number of comonents is known (i.e., µ 0 µ 1 ), the usual regularity conditions aly (Everitt and Hand, 1981). In other words, the MLE should be well-behaved in this simle examle. 2 Unfortunately, this does not bear out in ractice: even when the model is correctly secified and when µ 0 µ 1, the MLE behaves oorly in many realistic settings (Day, 1969; Hosmer Jr, 1973; Redner and Walker, 1984). 1.1 Our contribution In this aer, we (1) document two athologies for the MLE of comonent-secific means that arise when comonents are not well searated; (2) use moment equations to build intuition for these athologies and to characterize their behavior as a function of mixture arameters; and (3) roose a method to construct confidence sets via test inversion. In general, we resent results for the standard mixture case and then exlore additional comlications that arise in the context of rincial stratification models. First, the main inferential challenge for the mixture model in Equation 1.1 is estimating the difference in comonent means, µ 0 µ 1. There are two key issues. The first issue is known as ile u, which occurs when the likelihood surface for is unimodal and centered at zero, desite the fact that 0. Even with 1 Similarly to Tan and Chang (1972), we assume that π ă 1{2 so that the mixture is identified and to avoid degeneracy of the third cumulant of the mixture distribution when π 1{2. Aart from the restriction that π 1{2, this is comletely general, since we can simly switch the comonent labels. 2 ote that this is distinct from understanding the convergence of EM and related algorithms, as we are considering the erformance of the MLE rather than our ability to find it. Balakrishnan et al. (2014) give a recent discussion. 1

3 a bimodal likelihood, the same underlying athology that causes ile u can also lead to severe bias in the MLE. The second issue is the classic roblem of choosing the global mode in a bimodal likelihood (McLachlan and Peel, 2004). This roblem is articularly ernicious in the settings we exlore here. In our simulations, the usual heuristic of choosing the mode with the higher likelihood is tyically no better than a coin fli, often yielding the oosite sign from the truth, sgn mle sgn q. From a theoretical ersective, comlications arise when is non-zero but still close enough to zero (zero is the oint at which the regularity conditions no longer hold). Hence, is still oint identified but identification is weak. There are many examles of weak identification in other settings, including the weak instruments roblem (Staiger and Stock, 1997) and the moving average unit root roblem, which is the source of the term ile u (Shehard and Harvey, 1990; Andrews and Cheng, 2012). Chen et al. (2014) reviously exlored issues of weak identification in finite mixtures, though with a very different goal. In the sirit of this broad literature, we also show that standard asymtotic results for finite mixtures can break down in certain cases. To understand the behavior of the MLE in finite samles, we must investigate the likelihood equations directly. As these are analytically intractable, we develo intuition by deriving a simle method of moment estimator that catures key features of the MLE. We connect the athologies that we observe to the difficulty of estimating non-linear functions of the samle variance and skewness. We then derive analytic formulas for the robability that each athology will occur in a given setting and conduct simulations to show that the MLE and moment estimator dislay these behaviors at similar rates. Overall, these diagnostic formulas are analogous to design or ower calculations, giving a sense of the behavior of estimates rior to actually running the analysis. There are many ossible solutions to counter the athologies outlined above. Following common ractice in other settings with weak identification, we suggest one based on test inversion, namely generating simulation-based -values for a grid of values of and then inverting these tests to form confidence sets. This aroach is known as a grid bootstra in time series models (Andrews, 1993; Hansen, 1999; Mikusheva, 2007) and is closely related to the arametric bootstra for the Likelihood Ratio Test statistic in finite mixture models (McLachlan, 1987; Chen et al., 2014). When the model is correctly secified, this aroach yields exact -values u to Monte Carlo error. This aroach works well in the settings we exlore here. Finally, we discuss the case when the within-comonent variance, σ 2, is unknown, both for the equaland unequal-variance case. We show that the erformance of the MLE is even worse in this setting, with substantial bias for realistic values of. Although the technical discussion focuses narrowly on finite mixtures, our motivation remains the broader question of inference for causal effects within rincial strata. To date, only a handful of aers have directly addressed the finite samle roerties of mixtures for causal inference. Griffin et al. (2008) conduct extensive simulations and conclude that rincial stratification models are generally imractical in social science settings. Mattei et al. (2013) caution that univariate mixture models often yield oor results and suggest jointly estimating effects for multile outcomes, such as by assuming multivariate ormality. Mercatanti (2013) rooses an aroach for inference with a multimodal likelihood in the rincial stratification setting. Frumento et al. (2016) exlore methods for quantifying uncertainty in rincial stratification roblems when the likelihood is non-ellisoidal. See also Zhang et al. (2008), Richardson et al. (2011), and Frumento et al. (2012). Following Mattei et al. (2013), we illustrate the key concets through the evaluation of JOBS II, 2

4 a job training rogram that has served as a oular examle in the causal inference literature (see, for examle, Vinokur et al., 1995; Jo and Stuart, 2009). 1.2 Related literature on finite mixtures There is a vast literature on inference in finite mixture models, dating back to the seminal work of Pearson (1894). Everitt and Hand (1981), Redner and Walker (1984), Titterington et al. (1985), and McLachlan and Peel (2004) give thorough reviews. Frühwirth-Schnatter (2006) focuses on the Bayesian aradigm; Lindsay (1995) gives an overview of moment estimators; and Moitra (2014) discusses the research from machine learning. We briefly highlight several relevant asects of this literature. First, many early researchers used simulation to study the finite samle erformance of moment and maximum likelihood estimators for finite mixtures (Day, 1969; Tan and Chang, 1972; Hosmer Jr, 1973). Redner and Walker (1984) give an extensive review. Interestingly, we are aware of very few simulation studies on the erformance of these estimators ublished since Redner and Walker (1984), desite dramatic changes in comutational ower in the intervening thirty years. Frumento et al. (2016) offer an imortant recent excetion in the context of rincial stratification models. Consistent with our results, the authors find oor coverage for standard confidence intervals comuted via the MLE lus information- or bootstra-based standard errors. See also Chung et al. (2004). Second, there has been extensive research on the asymtotic behavior of finite mixtures models. The standard result is that, with a known number of comonents and under general regularity conditions, mixture arameters have? n-convergence (Redner and Walker, 1984; Chen, 1995). Chen (1995), however, shows that arametric convergence can break down in certain cases (see also Heinrich and Kahn, 2015). Similar issues arise in the asymtotic distribution of the Likelihood Ratio Test (LRT) statistic for testing the number of comonents in finite mixtures (McLachlan and Peel, 2004). McLachlan (1987) rooses a arametric bootstra to resolve these issues in settings including the two-comonent Gaussian mixture model. Chen et al. (2014) roose a similar method when the arameters are near but not at a singularity in the arameter sace. In their setting, however, the goal is to estimate the mixing roortions, π, although the broad argument is quite similar to what we discuss here. The asymtotic behavior of the LRT with known mixing roortion, π, is addressed in Quinn et al. (1987), Goffinet et al. (1992) and Polymenis and Titterington (1999). See also Aitkin and Rubin (1985). Finally, the sign error issue we address is exactly the well-studied question of choosing the correct mode in a multi-modal likelihood. McLachlan and Peel (2004) and Blatt and Hero (2007) give reviews. In virtually all cases, the standard recommendation is to choose the mode with the highest value of the likelihood. Some excetions include Gan and Jiang (1999) and Biernacki (2005), who introduce tests that leverage different methods for comuting the score function; Mercatanti (2013), who suggests a method that incororates moment estimates; and Frumento et al. (2012), who roose a scaled log-likelihood ratio statistic to comare different modes. 1.3 Bayesian inference for finite mixtures Finite mixture models are an imortant toic in Bayesian statistics, in art because mixtures fit naturally into the Bayesian aradigm (Frühwirth-Schnatter, 2006). This aroach offers some distinct advantages 3

5 over likelihood-based inference. 3 For examle, the Bayesian can incororate informative rior information, which can be esecially imortant in finite mixture modeling; see, for examle, Aitkin and Rubin (1985); Hirano et al. (2000); Chung et al. (2004); Lee et al. (2009); Gelman (2010). Moreover, our concern about sign error is trivial in the Bayesian setting: the global mode is simly a oor summary of a multi-modal osterior. More broadly, the weak identification issues we highlight in this aer are not necessarily relevant to a strict Bayesian. Imbens and Rubin (1997) and Mattei et al. (2013), for examle, characterize weak identification as substantial regions of flatness in the osterior, which increases uncertainty but does not lead to any fundamental challenges. 4 onetheless, we argue that our results are highly relevant for Bayesians who are also interested in good frequency roerties (Rubin, 1984). In the sulementary materials, we offer evidence that the athological behaviors we document for the MLE also hold for the osterior mean and median with some default rior values. In this sense, we conduct a Frequentist evaluation of a Bayesian rocedure (e.g., Rubin, 2004) and find oor frequency roerties overall. More generally, we agree that informative rior information can be a owerful tool for imroving inference in this setting. Our results rovide a framework for assessing how strong that rior information must be to avoid the itfalls that we document, for examle, by extending the grid bootstra to incororate a rior. 1.4 Paer lan The aer roceeds as follows. In Section 2, we give an overview of finite mixture models and their alications to causal inference. In Section 3, we describe finite samle roerties of the MLE and give results when the searation between µ 1 and µ 0 goes to zero asymtotically. In Section 4, we derive the method of moments estimator, characterize the athologies, and connect these results to the MLE. In Section 5, we roose methods for obtaining confidence intervals for the comonent means via test inversion. In Section 6, we aly these methods to the JOBS II examle. In Section 7, we relax the assumtion of known variance. We conclude with some thoughts on ossible next stes. Finally, the sulementary materials address several oints that go beyond the main text. First, we give additional discussion on alying these results to the rincial stratification. In articular, we discuss the role of covariates, which are widely used as art of rincial stratification models (e.g., Jo, 2002; Zhang et al., 2009). In general, we find that simly adding covariates without also incororating additional restrictions (such as conditional indeendence assumtions) can increase the robability of obtaining a athological result. Second, we address the erformance of both the case-resamling bootstra and Bayesian methods in this setting, finding oor erformance overall. Finally, we give a much more detailed roof of the main theorem. 3 The Bayesian aroach also introduces some unique challenges that we do not address here, namely the label-switching roblem (Celeux et al., 2000; Jasra et al., 2005) and the difficulty of secifying vague rior distributions for finite mixtures (Grazian and Robert, 2015). 4 Imbens and Rubin (1997) note that issues of identification [in the Bayesian ersective] are quite different from those in the frequentist ersective because with roer rior distributions, osterior distributions are always roer. The effect of adding or droing assumtions is directly addressed in the henomenological Bayesian aroach by examining how the osterior redictive distributions for causal estimands change. 4

6 2 Finite mixtures in causal inference 2.1 Setu To illustrate the role of finite mixtures in causal inference, we begin with the canonical examle of a randomized exeriment with noncomliance. We set u the roblem using the otential outcomes framework of eyman (1923) and Rubin (1974). We observe individuals who are randomly assigned to a treatment grou, T i 1, or control grou, T i 0. As usual, we assume that randomization is valid and that the Stable Unit Treatment Value Assumtion holds (SUTVA; Rubin, 1980; Imbens and Rubin, 2015). This allows us to define otential outcomes for individual i, Y i 0q and Y i 1q, under control and treatment resectively, with observed outcome, Yi obs T i Y i 1q ` 1 T i qy i 0q. The fundamental roblem of causal inference is that we observe only one otential outcome for each unit. We define the Intent-to-Treat (ITT) effect as the imact of randomization on the outcome, ITT ErY i 1q Y i 0qs. Throughout, we take exectations and robabilities to be over a hyothetical suer-oulation. In ractice, there is often noncomliance with treatment assignment (Angrist et al., 1996). Let D i be an indicator for whether individual i receives the treatment, with corresonding comliance D i 0q and D i 1q for control and treatment resectively. For simlicity, we assume that only individuals assigned to treatment can receive the active intervention (i.e., there is one-sided noncomliance), which is the case in the JOBS II evaluation. Formally, D i 0q 0 for all i. This gives two subgrous of interest: ever Takers, D i 1q 0, and Comliers, D i 1q 1. Following Angrist et al. (1996) and Frangakis and Rubin (2002), we refer to these subgrous interchangeably as comliance tyes or rincial strata, U i P tc, nu, with c denoting Comliers and n denoting ever Takers. The estimands of interest are the ITT effects for Comliers and ever Takers: ITT c ErY i 1q Y i 0q U i cs µ c1 µ c0, ITT n ErY i 1q Y i 0q U i ns µ n1 µ n0, in which µ c1, µ c0, µ n1, and µ n0 reresent the outcome means for Comlier assigned to treatment, Comliers assigned to control, ever Takers assigned to treatment, and ever Takers assigned to control, resectively. In the case of one-sided noncomliance, we observe stratum membershi for individuals assigned to treatment. Therefore, we can immediately estimate µ c1 and µ n1. Moreover, due to randomization, the observed roortion of Comliers in the treatment grou is, in exectation, equal to the overall roortion of Comliers in the oulation, π PtU i cu. Thus, we treat π as essentially known or, at least, directly estimable. The main inferential challenge is that we do not observe stratum membershi in the control grou. Rather we observe a mixture of Comliers and ever Takers assigned to treatment: Y obs i Z i 0 πf c0 y i q ` 1 πqf n0 y i q, (2.1) where f u0 yq is the distribution of otential outcomes for individuals in stratum u assigned to control. The standard solution for this roblem is to invoke the exclusion restriction for ever Takers, which 5

7 states that ITT n 0, or equivalently, µ n1 µ n0. With this assumtion, we can then estimate ITT c with the usual instrumental variables aroach (Angrist et al., 1996). Without this assumtion, however, we need to imose other structure on the roblem to achieve oint identification (see, e.g., Zhang and Rubin, 2003). 2.2 Model-based rincial stratification One increasingly common otion is to invoke arametric assumtions for the f u0 yq. In a seminal aer, Imbens and Rubin (1997) outlined a model-based framework for instrumental variable models, roosing a arametric model for the outcome distribution conditional on stratum membershi and treatment assignment, such as f uz y i q µ uz, σuzq, 2 with u reresenting stratum membershi and z reresenting treatment assignment. While the exclusion restriction can strengthen inference in this setting, it is not strictly necessary. Instead, identification is based entirely on standard results for mixture models. Since Imbens and Rubin (1997), dozens of aers have used finite mixtures for estimating causal effects. 5 For simlicity, we focus on alications of rincial stratification in which the researcher assumes that certain rincial strata do not exist (i.e., monotonicity) and it is ossible to directly estimate the distribution of rincial strata. For examle, in the case of two-sided noncomliance, the no Defiers assumtion makes this ossible (Angrist et al., 1996). While the more general case is imortant, inference is also more challenging. See Page et al. (2015) for a brief discussion of these assumtions. For our examle of one-sided noncomliance, we can write the observed data likelihood with ormal comonent distributions as: L obs θq ź π y i µ c1, σc1q 2 ˆ i: Z, Di obs 1 ź i: Z, Di obs 0 ź i: Z i 0 1 πq y i µ n1, σ 2 n1q ˆ π yi µ c0, σ 2 c0q ` 1 πq y i µ n0, σ 2 n0q, where θ reresents the vector of arameters and y i µ, σ 2 q is the ormal density with mean µ and variance σ 2. The observed data likelihood for individuals assigned to treatment (Z i 1) immediately factors into the likelihood for the Comliers and the likelihood for the ever Takers. We can directly estimate the comonent arameters under treatment, µ c1, µ n1, σ 2 c1, and σ 2 n1. We can also directly estimate π among individuals assigned to treatment. Therefore, we are essentially left with a two-comonent ormal mixture model with known π among those individuals assigned to control. 6 ote that the mixture ortion of this likelihood is in general unbounded due to the ossibility of either σ 2 u1 or σ 2 u0 being close to 0. However, the whole likelihood is bounded with robability one due to the effect of the non-mixture terms. 7 5 Some examles of other relevant aers are Little and Yau (1998); Hirano et al. (2000); Barnard et al. (2003); Ten Have et al. (2004); Gallo et al. (2009); Zhang et al. (2009); Elliott et al. (2010); Zigler and Belin (2011); Frumento et al. (2012); Page (2012); Schochet (2013). 6 ote that there is a very small amount of information about π from the mixture model among those assigned to the control grou. Given the other comlications that arise in mixture modeling, we ignore this and treat π as being estimated directly from the treatment grou. 7 We are assuming that at least one observation is made in each comliance stratum of the treatment grou. 6

8 To estimate ITT c and ITT n, we need to estimate µ c0 and µ n0. This means that the comonent-secific variances, σc0 2 and σn0, 2 are nuisance arameters. We initially assume that the comonent-secific variances are constant across comonents: σ 2 σc1 2 σc0 2 σn1 2 σn0 2 (see, for examle, Gallo et al., 2009). Since we can directly estimate σc1 2 and σn1, 2 this means that we no longer need to estimate the comonent-secific variances in the finite mixture model and can treat these arameters as known. 8 We relax this assumtion in Section JOBS II For a running examle, we use the Job Search Intervention Study (JOBS II), a randomized field exeriment of a mental health and job training intervention among unemloyed workers (Vinokur et al., 1995) that has been extensively studied in the causal inference literature (Jo and Stuart, 2009; Mattei et al., 2013). We focus on a subset of 410 high risk individuals, with randomly assigned to treatment and randomly assigned to control. For illustration, we estimate the imact of the rogram on (log) deression score six months after randomization. An imortant comlication in this study is that only 55% of those individuals assigned to treatment actually enrolled in the rogram. Therefore, we are not only interested in the overall ITT, but also in the ITT for Comliers and the ITT for ever Takers. ote that we do not invoke the exclusion restriction for ever Takers; in other words, we want to estimate ITT n rather than assume that ITT n 0. To do so, we follow Mattei et al. (2013) and assume the outcome distribution is ormal for each U and Z combination. As an intermediate ste, we are interested in µ c0 µ n0 for this model. Figure 1 shows the distribution of the MLE of for 1000 fake data sets generated from a two-comonent homoskedastic ormal mixture model with 132, π 0.55, and σ 1 (so that all estimates are in effect size units). 9 In Figures 1a and 1b, the assumed values of are 0.5 and 1.0 standard deviations, resectively, which are quite large differences in the context of JOBS II. Clearly the samling distribution of the MLE is markedly non-ormal, showing strong bimodality in addition to a large sike around zero. 3 Likelihood inference in finite mixture models 3.1 Asymtotic results We begin by documenting that standard asymtotic results can break down when comonents are not well searated. In articular, standard results state that, with a known number of comonents and under general regularity conditions, mixture arameters have? n-convergence (Redner and Walker, 1984; Chen, 1995). Based on examles like those in Figure 1, these results do not necessarily kick in for settings where the difference in comonent means is small relative to the samle size. To exlain this disconnect, we consider the case in which Y i are distributed as in Equation 1.1 but the value of the comonent means µ 1 and µ 0 vary with n, becoming µ 1,n and µ 0,n. Corresondingly, the searation of means becomes 8 As with the mixing roortion, there is a very small amount of information about the overall σ 2 in the finite mixture model. Again, we ignore this comlication. 9 ote that this does not fit into the arameterization of Equation 1.1, but that all the same results hold for π ą 1{2 and µ 1 µ 0. We switch the arameterization to match the JOBS II data set and Equation 2.1. Also, the only unknowns in the likelihood are the comonent means; all other arameters are assumed known and fixed at the correct values. Finally, we calculate the MLE directly, rather than via EM (Demster et al., 1977). 7

9 Frequency Frequency Difference in Stratum Means (Std. Dev.) (a) Assumed Difference in Means is 0.5 SD Difference in Stratum Means (Std. Dev.) (b) Assumed Difference in Means is 1.0 SD Figure 1: Distribution of the MLE for for 1000 fake data sets generated from a two-comonent homoskedastic ormal mixture model with 132, π 0.55, and σ 1. n µ 0,n µ 1,n. (3.1) We rove that if n on 1{4 q in a class of models similar to those considered by Chen (1995), then n n O n 1{4 q and n n o n 1{4 q, with n denoting the maximum-likelihood estimator of n. That is, if the searation of µ 1,n and µ 0,n disaears quickly enough, the maximum-likelihood estimator s arametric rate of convergence is sacrificed. Theorem 3.1. Let Y i,n for i P t1,..., nu and n P be drawn indeendently from the model Y i,n π ` n, σ 2 ˆ π ` 1 πq 1 π n, σ 2, (3.2) with n on 1{4 q. Then n n O n 1{4 q and n n o n 1{4 q. The roof of this result, which follows Chen (1995), is given in the Aendix. The key theoretical asect of the above model is that the Fisher Information of n in each y i,n, I n q, is 0 when n 0. Thus, the Fisher Information of n contained in the entire samle ty 1,n,..., y n,n u is ni n q, and there are two cometing limits: n Ñ 8 and I n q Ñ Pathologies of the MLE The above result shows that the usual arametric asymtotic rate of convergence of the MLE can break down in a certain class of models. Recognizing that we should not necessarily exect arametric convergence, we now turn to characterizing the finite samle behavior of the MLE. The three-art samling distribution in Figure 1 is a good illustration of the key inferential issues we face. We refer to the large oint mass at zero, such as in Figure 1a, as ile u. For these simulated data sets, mle «0 even though 0. This generally occurs when the likelihood surface is unimodal, as shown in Figure 2a. We refer to those MLEs with an 8

10 Log-likelihood Log-likelihood Mean Mean Mean 1 (a) Examle log-likelihood for Mean 1 (b) Examle log-likelihood for 1.0 Figure 2: Examles of unimodal and bimodal log-likelihoods generated from the two-comonent Gaussian mixture oosite sign from the truth as a sign error, in which sgn mle q sgn q. For this roblem, the mixture likelihood surface is generally bimodal (see Figure 2b). A sign error occurs when the global MLE, obtained by choosing the mode with the higher likelihood (McLachlan and Peel, 2004), chooses the wrong mode. These athologies are resent across extensive simulations. Secifically, we set arameters to π P t0.2, 0.325, 0.45u, for P t50, 100, 200, 400, 500, 1000, 2000, 5000u and P t0.25, 0.5, 0.75, 1u. For each set of arameters, we simulate 1000 fake data sets from the Gaussian mixture model in Equation 1.1. For each data set, we then calculate the MLE, comute standard errors via the Hessian of the log-likelihood evaluated at the MLE (i.e., the observed Fisher information), and generate nominal confidence intervals via the oint estimate 1.96 standard errors. We then calculate the average bias and coverage rates at each setting of the arameter values, as well as the emirical robabilities of ile u and sign error. Figure 3 shows the emirical bias and coverage, resectively, for mle. The left column shows simulation results when the true 0.25 and the right column shows results when the true For both cases, π and σ 2 1. The first row shows the average estimate for mle across simulations, which is essentially 0 when 0.25 and slowly climbs from 0 when The second row shows the emirical coverage of the 95% confidence intervals for. Strikingly, this coverage deteriorates as increases, although it begins to climb again for The third row shows the emirical robability of each athology (where correct corresonds to an MLE that is neither ile u nor the wrong sign). For 0.25, the robability of ile u is shockingly high, and is the likeliest outcome for samle sizes for samle sizes less than 200. Even with samle sizes as high as 5000, the estimated sign is as likely to be negative as ositive. These athologies hel to exlain the seemingly strange attern with bias and coverage. For 0.25, for examle, coverage gets worse because confidence intervals get tighter around a oint estimate that has the wrong sign. 9

11 Finally, while we focus on the behavior of the MLE itself, many factors contribute to oor coverage. Following Rubin and Thayer (1983) and Frumento et al. (2016), we note that in many settings the log-likelihood at the MLE is not well-aroximated by a quadratic, which artly exlains the simulation results (see also McLachlan and Peel, 2004). In the sulementary materials, we also show that bootstra-based standard errors do not resolve these issues Simulation Mean for % Coverage for Probability Correct Pile U Wrong Sign Simulation Mean for % Coverage for Probability Correct Pile U Wrong Sign Figure 3: Simulation results for the MLE of. The dotted lines reresent the true values. 4 Method of Moments 4.1 Point estimation and confidence intervals To understand this oor behavior for the MLE, we would ideally investigate the likelihood equations directly. Unfortunately, the likelihood equations are intractable and the mixture likelihood surface itself is notoriously comlicated (Lindsay, 1983). We therefore develo a simle method of moments estimate, mom, that catures the essential features of the MLE in the settings we consider. 10 Our key oint which has been made reeatedly before (e.g., Kiefer, 1978) is that mle and mom essentially use the same samle information in ractice. While the MLE incororates an infinite number of moments and is an efficient estimator (Redner 10 ote that moment estimators for finite mixture models can be quite comlex the goal here is to build intuition rather than to imrove on existing moment estimators. See Furman and Lindsay (1994a,b). Titterington (2004) offers a review. 10

12 and Walker, 1984), in ractice there is negligible information in the higher order moments, at least beyond the third or fourth moments. As a result, there is little reason to exect the MLE to fare better than the MOM estimate in settings where the latter breaks down. Similarly, by characterizing the MOM estimate, we can obtain good heuristic aroximations of the MLE s behavior. To define mom, we first let κ k be the kth cumulant of the observed mixture distribution (Tan and Chang, 1972). Then the first two mixture cumulants, the mean and variance, are: κ 1 πµ 0 ` 1 πqµ 1, κ 2 σ 2 ` π1 πqµ 1 µ 0 q 2, where σ 2 is the within-comonent variance and π1 πqµ 1 µ 0 q 2 is the between-comonent variance. Substituting in µ 1 µ 0 yields κ 1 µ 1 ` π, (4.1) κ 2 σ 2 ` π1 πq 2. (4.2) Since π and σ 2 are known, we can immediately estimate 2 using the observed mixture variance: x 2 κ 2 σ 2 π1 πq. In words, x 2 is the scaled difference between the total mixture variance and the within-comonent variance. The estimate of κ 2 is the usual unbiased estimate of the samle variance. ote that we do not constrain the numerator to be non-negative in order to highlight the athological issues that arise in the MLE. If we know the sign of, then we take the square root to obtain a moment estimate for. However, if the sign of is unknown, we must estimate it. We therefore need one more moment equation. A natural choice is the third cumulant of the observed mixture: κ 3 π1 πq1 2πq 3. (4.3) Since π is assumed to be known, { sgn q sgnκ3 q. We do not consider the knife-edge case of π 1{2, in which there is no information in the data about the ordering of the comonents. This yields the following moment estimator for, with π ă 1 2 : d mom κ 2 σ sgnκ 3 q 2 π1 πq. (4.4) ote that the third moment contains some information about the magnitude of, in addition to information about its sign. Therefore, in this simle case, is over-identified and we can use the Generalized Method of Moments (GMM; Hansen, 1982) or the Moment Generating Function method of Quandt and Ramsey (1978) to combine the moment equations. However, we use this simle estimate to aroximate the MLE, and so we do not need to focus on these more comlex aroaches. Accounting for uncertainty in mom is concetually straightforward. In the case of the two-comonent 11

13 Gaussian mixture model, Tan and Chang (1972) derive the samling distribution of the first three cumulants u to O1{n 2 q: κ 1 κ 2 κ 1 κ 2, 1 n κ 2 κ 3 c π 13 4 c π 22 4 ` 2κ 2 2 c π 32 5 ` 6κ 2 κ 3, (4.5) κ 3 κ 3 c π 33a 6 ` c π 33b κ 2 4 ` 6κ 3 2 where c π rc are constants that deend on π. 11 We can then aly the Delta method to Equation 4.4 to obtain aroximate confidence intervals for mom. In ractice, however, mom is not a smooth function of κ 2 and κ 3. To make matters worse, estimates of higher order moments are extremely noisy: the variance of the samle variance deends on the fourth moment, and the variance of the samle skewness deends on the sixth moment. As a result, even if the true value of is far from zero, due to samling variability, the observed samle moments could yield estimates at or near critical oints in the maing from the samle moments to mom. We can now immediately see our two athologies: Pile u. If the estimated overall variance is less than the assumed within-grou variance, κ 2 ă σ 2, then x 2 ă 0 and mom is undefined. Heuristically, the MLE in these settings is mle 0 and generally corresonds to a unimodal likelihood. In other words, the MLE imlicitly restricts the estimate of 2 to be nonnegative. Sign error. If the estimated sign of the skewness does not equal the true sign of the skewness, sgnκ 3 q sgnκ 3 q, then sgn mom sgn q. Heuristically, this corresonds to the case when the higher mode in a bimodal likelihood does not, in fact, corresond to the global mode. 4.2 Assessing athologies in ractice Using the samling distributions from Equation 4.5, we can calculate the aroximate robability of each athology occurring for any given, π, n, and σ 2. For the case where ą 0 and π ă 1 2, the marginal robabilities of the two athologies are:?nπ1 Pile uq Pκ 2 ă σ 2 πq 2 q «Φ a, (4.6) cπ 4 ` 2κ 2 2?nπ1 πq1 2πq 2 Psign errorq Psgnκ 3 q sgnκ 3 qq «Φ a. (4.7) c 2 π 6 ` c 2 πκ 2 4 ` 6κ 2 2 We use simulation to assess the accuracy of these aroximations; see the sulementary materials for additional information. For ile u, the moment aroximations are uniformly excellent, with over 95% agreement across simulation values. For sign error, the aroximations are quite good for moderate π, but are less informative with smaller and more extreme π. Figure F.7a shows the joint aroximate robabilities of ile u and sign error, calculated via the joint distribution from Equation 4.5, with π 0.325, 0.25, and varying samle size. Unsurrisingly, as the 11 c π π1 πq, cπ 22 π1 πq1 6π1 πqq, cπ 32 π1 πq1 2πq1 12π1 πqq, cπ 33a π1 πq1 30π1 πq ` 120π 2 1 πq 2 q ` 9π 2 1 πq 2 1 2πq 2, c π 33b 9π1 πq1 6π1 πqq. 12

14 samle size and increase, the chance of a athological result indeed decreases. However, these athologies are hardly small samle issues for 0.25, which would be quite large in many social science alications, ile u remains the likeliest outcome even with samle sizes in the thousands. For 0.75 (not shown) there is only a 70% chance of a correct estimate for Figure F.7b shows results for a moderate samle size of 200. In this case, if the mixture means are 0.5 standard deviations aart, a given estimate is just as likely to be a ile u, have the wrong sign, or be correct. These formulas can be used in a manner analogous to design or ower calculations, as researchers tyically know and π rior to estimating a mixture model. Grahs such as Figure F.7b, can immediately show whether ile u and sign error are meaningful issues for lausible values of. Probability Correct Pile U Wrong Sign Samle Size Probability Correct Pile U Wrong Sign Assumed (a) 0.25 (b) 200 Figure 4: ormal aroximations for the robability of each athology for given samle size and searation of means. π is assumed to be The correct outcome is defined as one minus the robability of ile u or sign error. Finally, in smaller samles, we might be concerned about the ormal aroximation to the samling distributions for κ 2 and κ 3. In this case, it is convenient to use a case-resamling bootstra to aroximate these distributions. While this does not yield analytical formulas for the robability of each athology, it is straightforward to estimate both Ptκ 2 ă σ 2 u and Ptκ 3 ă 0u, without needing to rely on an asymtotic aroximation. ote that this is a standard alication of the case-resamling bootstra for which all the usual guarantees hold rather than using the bootstra for uncertainty on mixture comonent means. 4.3 Bias due to ile u If x 2 ă 0, it is clear that mom is undefined. Less obvious but no less imortant is that mom is severely biased if x 2 ą 0 but the robability of ile u is large. Intuitively, this occurs because we are imlicitly conditioning on the fact that x 2 ą 0 when mom is well-defined; we are therefore estimating the mean of a 13

15 Mean of when 2 > MLE MOM Figure 5: Mean for mle (simulated) and mom (calculated) when κ 2 ą σ 2. The true value is 0.25 truncated ormal. We see the same behavior with the MLE; the mode of the bimodal likelihood is biased away from zero because we are imlicitly conditioning on the fact that the observed likelihood surface is indeed bimodal rather than unimodal. Figure 5 shows the bias of mle in simulation studies for which there are no athologies and for which the true 0.25 and π The bias is substantial throughout, and for samle sizes ă 1000, the bias is larger in magnitude than. As with the overall MLE, we can characterize this behavior using the MOM estimate. In articular, we use LOTUS and numerical integration to obtain the exected value of the truncated distribution, Er mom κ 2 ą σ 2 s. As shown in Figure 5, this is a good aroximation to the bias calculated via simulation. It is useful to note that the bias of the MLE in this setting is closely related to the bias induced by introducing identifiability constraints, such as ą 0 (Jasra et al., 2005; Frühwirth-Schnatter, 2006). In both cases, the MLE is the maximum of a truncated likelihood surface, truncated at the line 0. 5 Confidence sets via inverting tests Given the oor erformance of the MLE, we are interested in methods that erform well even when is small. Based on the large literature on weak identification in other settings, we resume that many such methods are ossible. As a starting oint, we suggest an aroach to construct confidence intervals 14

16 based on inverting a sequence of tests. This aroach is widely used in other weak identification settings, namely weak instruments (e.g., Staiger and Stock, 1997; Kang et al., 2015) and the unit root moving average roblem (Mikusheva, 2007). It is also closely related to the method of constructing confidence intervals for causal effects by inverting a sequence of Fisher Randomization Tests (Imbens and Rubin, 2015). At the same time, this aroach has its drawbacks. First, while test inversion yields confidence sets with good coverage roerties, it does not necessarily yield good oint estimates. In articular, it is ossible to construct a Hodges-Lehmann-style estimator via the oint on the grid with the highest -value (Hodges and Lehmann, 1963). But since ile u and sign error remain issues, any oint estimator in this case should be interreted with caution. Second, the coverage guarantees hold only when the model is correctly secified; under even moderate mis-secification, the resulting estimator can cease to exist (Gelman, 2011). In the sulementary material, we exlore the effects of mis-secification on the resulting confidence sets. Unsurrisingly, we find that the aroach works well under mild mis-secification (i.e., generating data via t 50 rather than ormal) but otherwise erforms oorly. ote that the MLE erforms oorly even when the model is correctly secified. Alternatively, researchers uninterested in test inversion for confidence intervals might nonetheless be interested in using this aroach to assess model fit. If the roosed rocedure rejects everywhere, this is evidence that the ormal mixture model is a oor fit. We discuss two basic aroaches here. Our first aroach is a version of the grid bootstra of Andrews (1993) and Hansen (1999), which generates Monte Carlo -values by simulating fake data sets from the null hyothesis. While the grid bootstra is concetually straightforward and enjoys theoretical guarantees (Mikusheva, 2007), it is also comutationally intensive. Our second aroach is therefore a fast aroximation that directly uses the ormal samling distribution in Equation 4.5 to derive a χ 2 test at each grid oint. To demonstrate these methods, we first outline inference for alone and then extend this to inference for the comonent-secific means, µ 0 and µ 1. Since the additional details are not central to our argument, we discuss inference for the broader rincial stratification model in the sulementary materials. 5.1 Overview of grid bootstra To conduct a grid bootstra, we first need a grid. Define t 0, 1,..., n u with i ą j for i ą j. The immediate goal is then to obtain a -value for the following null hyotheses for each value j P : H 0 : j vs. H 1 : j. (5.1) For convenience we first center the data (we return to this in the next section). ext, we need a test statistic, ty, j q, that is a function of the observed (or simulated) data and the value of under the null hyothesis, j. For a given, and initially assuming π and σ 2 are known, we then obtain exact -values through simulation with the following rocedure: For each j P Calculate the observed test statistic, t obs j ty obs, j q. Generate B data sets of size from the model ˆ ˆ j y j iid π 2, σ2 ` 1 πq j 2, σ2. 15

17 For each simulated y j, comute t j ty j, jq. Calculate the emirical -value of t obs j as a function of the null distribution, t j. Calculate the confidence set, CS α q t j : j q ą 1 αu for a secified significance level α, where j q is the emirical -value of mle assuming that j. ote that the resulting confidence set might not be continuous, which could occur if the samling distribution is strongly bimodal. 5.2 Constructing a test statistic So long as the model is correctly secified, this aroach yields an exact -value for any valid test statistic, u to Monte Carlo error (Mikusheva, 2007). We roose a test statistic based on the joint distribution of κ 2 and κ Equation 4.5 suggests a natural combination of the estimated cumulants: t κ y, j q d 2, d 3 qvarκ 2, κ q 1 3 d 2, d 3 q T, (5.2) where d k κ k κ k, and we use the assumed null of j to obtain κ 2, κ 3 q and Varκ 2, κ 3 q. In ractice, the ormal aroximation in Equation 4.5 is excellent, even for modest samle sizes (say ą 100). This imlies: t κ y, j q a χ 2 2. We can therefore obtain a -value via a Wald test at each grid oint, which is fast comutationally. Finally, to use these aroaches to estimate comonent means, we need to (1) exand the grid, and (2) exand the test statistic. A natural choice for a grid of oints is the two-dimensional grid over µ 0 and µ 1. To exand the test statistic, we directly use the first three cumulants from Equation 4.5 to obtain a joint test statistic as in Equation 5.2: t κ y, j q d 1, d 2, d 3 qvarκ 1, κ 2, κ 3 q 1 d 1, d 2, d 3 q T χ 2 3. (5.3) As above, we can obtain -values via the grid bootstra rather than via the χ 2 distribution. Figure 6 shows the distribution of -values for three different examles from the same data generating rocess, with 1000, π 0.325, σ 2 1, µ 0 ` 1 8, µ Figure A.1 shows the 95% coverage for the confidence sets obtained through this fast aroximation. As exected, the coverage is essentially exact. In articular, 95% coverage for this rocedure is far better than the corresonding coverage based on the MLE. 6 Alication to JOBS II We now return to our examle of JOBS II. As described in Section 2, we focus on the subset of randomly assigned to treatment and randomly assigned to control. In the treatment grou, 12 There are many ossible alternatives. For examle, Frumento et al. (2016) suggest test statistics based on scaled loglikelihood ratios. Another otion is to use univariate test statistics based on κ 2 or κ ote that the χ 2 distribution no longer holds when µ 0 µ 1. While we can use a univariate ormal distribution to obtain a valid -value in this case, this additional comlication is generally unnecessary in ractice. 16

18 µ µ 0 µ µ 0 µ µ 0 Figure 6: Three examles of the distribution of Wald test -values from Equation 5.3. Simulated data are from Equation 1.1 with 1000, π 0.325, σ 2 1, µ 0 1 8, µ The dark line shows the cutoff for The red dot shows the true value. ote that the Wald test is undefined when µ 0 µ 1. Table 1: Summary statistics for observed grous in JOBS II Z D obs Observed Mean Observed SD Possible Princial Strata Comliers ever Takers Comliers and ever Takers 1 π 45 ercent of individuals do not enroll in the rogram. The rimary outcome is log deression score six months after randomization. For convenience, we standardize the outcome by subtracting off the grand mean and dividing by σ 1 a πσ n1 2 ` 1 πqσ2 c1, the estimated within-comonent standard deviation under treatment. Table 1 shows summary statistics for the three observed grous for the standardized outcome. Based on the grou means, it is clear that workers who are observed to enroll in the rogram have lower deression, on average, than those who do not. ote that the oint estimates for σ c1 and σ n1 are quite close, which is consistent with our equal variance assumtion (we relax this assumtion below). To assess the erformance of the MLE in this setting, we simulate data with the same arameters as in the JOBS II examle, 132, π 0.55, and σ 2 1. Figures 7a and 7b show the bias and 95% coverage for the MLE for these arameters and assumed values of. Figure 7c shows the robability of each athology for these arameters. ote that that the results are unchanged if we consider negative values of. The attern is striking. For values of ă 0.5, the most likely mle is 0. That is, the likeliest outcome is a unimodal likelihood with the mode centered at 0. Unsurrisingly, the MLE has oor bias and coverage roerties for reasonable values of. In articular, we see that the coverage of mle decreases as the assumed value of increases from 0.1 to 2.0. This is due to the increasing robability of a sign error occurring and the oor conditional coverage of mle conditional on making a sign error. These suggest that, at best, we should interret the MLE in this examle with caution. Figure 8 shows the outcome distribution and bootstra samling distributions for κ 2 and κ 3. The observed standard deviation for the control grou is κ , and, based on the bootstra, Pκ 2 ă 1q This suggests that there is little information in the second moment about the magnitude of. The observed third samle moment for the control grou is κ , with a bootstra robability Pκ 3 ă 0q This suggests that the outcome distribution is sufficiently skewed such that the discontinuity, and hence the 17

4. Score normalization technical details We now discuss the technical details of the score normalization method.

4. Score normalization technical details We now discuss the technical details of the score normalization method. SMT SCORING SYSTEM This document describes the scoring system for the Stanford Math Tournament We begin by giving an overview of the changes to scoring and a non-technical descrition of the scoring rules