Estimation and Inference for Set-identi ed Parameters Using Posterior Lower Probability

Size: px

Start display at page:

Download "Estimation and Inference for Set-identi ed Parameters Using Posterior Lower Probability"

Randall Peters
5 years ago
Views:

1 Estimation and Inference for Set-identi ed Parameters Using Posterior Lower Probability Toru Kitagawa CeMMAP and Department of Economics, UCL First Draft: September 2010 This Draft: March, 2012 Abstract In inference about set-identi ed parameters, it is known that the Bayesian probability statements about the unknown parameters do not coincide, even asymptotically, with the frequentist s con dence statements. This paper aims to smooth out this disagreement from a multiple prior robust Bayes perspective. We show that there exist a class of prior distributions (ambiguous belief), with which the posterior probability statements drawn via the lower envelope (lower probability) of the posterior class asymptotically agree with the frequentist con dence statements for the identi ed set. With such class of priors, we analyze statistical decision problems including point estimation of the set-identi ed parameter by applying the gammaminimax criterion. From a subjective robust Bayes point of view, the class of priors can serve as benchmark ambiguous belief, by introspectively assessing which one can judge whether the frequentist inferential output for the set-identi ed model is an appealing approximation of his posterior ambiguous belief. Keywords: Partial Identi cation, Bayesian Robustness, Belief Function, Imprecise Probability, Gamma-minimax, Random Set. JEL Classi cation: C12, C15, C21. t.kitagawa@ucl.ac.uk. I thank Andrew Chesher, Siddhartha Chib, Jean-Pierre Florens, Charles Manski, Ulrich Müller, Andriy Norets, Adam Rosen, Kevin Song, and Elie Tamer for valuable discussions and comments. I also thank the seminar participants at Academia Sinica Taiwan, Cowles Summer Conference 2011, EC 2 Conference 2010, MEG 2010, RES Annual Conference 2011, Simon Fraser University, and the University of British Columbia for helpful comments. All remaining errors are mine. Financial support from the ESRC through the ESRC Centre for Microdata Methods and Practice (CEMMAP) (grant number RES ) is gratefully acknowledged. 1

2 1 Introduction In the usual situations of inferring identi ed parameters, the Bayesian probability statements about the unknown parameters are similar, at least asymptotically, to the frequentist con dence statements for the true value of the parameters. In partial identi cation analysis initiated by Manski (1989, 1990, 2003, 2008), it is known that such asymptotic harmony between the two inference paradigms breaks down (Moon and Schorfheide (2011)). The Bayesian interval estimates for the set-identi ed parameter are shorter, even asymptotically, than the frequentist ones, and they asymptotically lie strictly inside of the frequentist s con dence intervals. Frequentists might interpret this phenomenon as that the Bayesian s over-con dence in their inferential statement is ctitious. Bayesians, on the other hand, might reply that the frequentist s con dence statements su er from an extreme conservatism that lacks any posterior probability justi cation. The main aim of this paper is to smooth out such disagreement between the two schools of statistical inference from a perspective of robust Bayes inference: a third paradigm of statistical inference, in which we can incorporate researcher s partial prior knowledge into posterior inference. Among various robust Bayes approaches, this paper in particular focuses on a multiple prior Bayes analysis, where the partial prior knowledge or the robustness concern against prior misspeci cation is modeled by a class of priors (ambiguous belief). The Bayes rule is applied to each prior to form the class of posteriors. Posterior inference procedures considered in this paper operate on the thus-constructed class of posteriors by focusing on its lower and upper envelopes, the so-called posterior lower and upper probabilities. Along this robust Bayes approach, this paper considers the following questions. First, are there any classes of priors, with which the probabilistic statements made via the posterior lower probability can approximate the frequentist s con dence statements for set-identi ed parameters? The answer turns out to be yes. When the parameters are not identi ed, a prior distribution of the model parameters is decomposed into two components: one that can be updated by data (revisable prior knowledge) and one that can never be updated by data (unrevisable prior knowledge). We show that, if a prior class is designed in such way that it shares a single prior distribution for revisable prior knowledge, but allows for arbitrary prior distributions for the unrevisable prior knowledge, then, under certain regularity conditions, the posterior interval estimates constructed based upon the posterior lower probability asymptotically attains the correct frequentist coverage probability for the identi ed set. 2

3 The second question we examine is, with such class of priors, is it possible to formulate and solve statistical decision problems including point estimation of the set-identi ed parameter? We approach this question by adapting the gamma-minimax decision analysis, which can be seen as a minimax analysis in the multiple prior setup. We demonstrate that the proposed prior class leads us to an analytically tractable and numerically solvable formulation of the gamma-minimax decision problem, provided that the identi ed set for the parameter of interest can be computed for each value of the identi ed parameters We believe our analyses are insightful not only from the "objective" Bayesian point of view but also from the subjective robust Bayesian point of view, because the classes of priors that can asymptotically match with the frequentist s inference can serve as benchmark ambiguous belief. That is, by subjectively assessing if the ambiguous belief represented by the benchmark prior class appears reasonable or not, one can judge if the frequentist s inferential output is too conservative or not in view of his posterior ambiguous belief. On this issue, we argue through an example that some priors in the benchmark prior class appear less appealing than some others, so that subjective robust Bayesians would consider that the frequentist inference statements for identi ed set are too uninformative to interpret it as an approximation of their posterior ambiguous belief. 1.1 Literature Review Estimation and inference in partial identi ed models have been a growing area of research in econometrics. From the frequentist perspective, Horowitz and Manski (2000) construct con dence sets for the interval identi ed set. Imbens and Manski (2004) propose uniformly asymptotically valid con dence sets for an intervally-identi ed parameter. Stoye (2009) extends their analysis to a di erent set of conditions. Chernozhukov, Hong, and Tamer (2007) develop a way to construct asymptotically valid con dence sets for an identi ed set based on the criterion function approach, which can be applied to the wide range of partially identi ed models including moment inequality models. Related to the criterion function approach, the literatures on construction of con dence sets by inverting test statistics include, Andrews and Guggenberger (2009), Andrews and Soares (2010), and Romano and Shaikh (2010), to list a few. From the Bayesian perspective, Neath and Samaniego (1997), Poirier (1998), and Gustafson (2009, 2010) analyze how the Bayesian updating operates when a model lacks identi ability. Liao and Jiang (2010) conduct Bayesian inference for moment inequality models based on the pseudo- 3

4 likelihood. Moon and Schorfheide (2011) compares asymptotic properties of frequentist and Bayesian inference for set-identi ed models, and our robust Bayes analysis is motivated by their important nding on the asymptotic disagreement between them. The lower and upper probabilities originate from Dempster (1966, 1967a, 1967b, 1968) in his ducial argument of drawing posterior inferences without specifying a prior distribution. Demspter s analyses have given a large impact to the eld of statistics, arti cial intelligence, and decision theory, such as the belief function analysis of Shafer (1976, 1982), the imprecise probability analysis of Walley (1991). The lower and upper probabilities have been playing the major role also in the robust Bayes analysis for measuring the global sensitivity of the posterior (Berger (1984), Berger and Berliner (1986)) and for characterizing a class of posteriors (DeRobertis and Hartigan (1981), Wasserman (1989, 1990), and Wasserman and Kadane (1990)). In econometrics, the pioneering work using multiple priors was carried out by Chamberlain and Leamer (1976), and Leamer (1982), who obtained the bounds for the posterior mean of the regression coe cients when a prior varies over a certain class. These previous studies do not explicitly considers non-identi ed models, while this paper focuses on non-identi ed models, and aims to clarify a link between an early idea of the lower and upper probabilities and a recent issue on inference in set-identi ed models. The posterior lower probability that we shall obtain with our speci cation of prior class turns out to be an in nite-order capacity, or equivalently, a containment functional in the random set theory (see, e.g., Molchanov (2005)). Beresteanu and Molinari (2008) and Bereseanu, Molchanov, and Molinari (2012) show the usefulness and wide applicability of random set theory to partially identi ed models by viewing an observation as a random set and the estimand as its Aumann expectation. Galichon and Henry (2006, 2009) proposed a use of in nite-order capacity in de ning and inferring the identi ed set in the framework of incomplete econometric models. Our analysis albeit the di erence in the source of probability is seen as a robust Bayes analogue to these frequentist inference procedures based of the random set theory and the capacity. Our decision theoretic analysis employs the gamma-minimax criterion (see Berger (1985)); this leads us to a decision that minimizes the risk formed under the most pessimistic prior within the class. The gamma-minimax decision problem often becomes challenging both analytically and numerically, and its analysis is limited to rather simple parametric models with a certain choice of prior class (see Betro and Ruggeri (1992), and Vidakovic (2000), for example). Our speci cation of prior class, in contrast, o ers a simple solution to the gamma-minimax decision problem, provided 4

5 that the identi ed set for the parameter of interest can be computed for each value of the identi ed parameters. 1.2 Plan of the Paper The rest of the paper is organized as follows. In Section 2, we illustrate our main results using a simple example of missing data. In Section 3, a general framework is introduced. We construct a class of prior distributions that can contain arbitrary unrevisable prior knowledge, and derive the posterior lower and upper probabilities. Statistical decision analyses with multiple priors are examined in Section 4. In Section 5, we investigate how to construct the posterior credible region based on the posterior lower probability, and its large-sample behavior is examined in an interval-identi ed parameter case. Proofs and lemmas are provided in Appendix A. 2 An Illustrating Example: Missing Data In this section we illustrate the main results of the paper using an example of missing data (Manski (1989)). Let Y 2 f1; 0g be a binary outcome of interest (e.g., a worker is employed or not) and let D 2 f1; 0g be an indicator of whether Y is being observed (D = 1) or not (D = 0), i.e., the subject responded or not. Data are given by a random sample of size n, x = f(y i D i ; D i ) : i = 1; : : : ; ng. The starting point of our analysis is to specify a parameter vector 2 that pins down the distribution of data as well as the parameter of interest. four probability masses, ( yd ), yd Pr (Y = y; D = d), y = 1; 0, d = 1; 0. likelihood for is written as p(xj) = n n [ ] n mis ; where n 11 Here, can be speci ed by a vector of The observed data = P n i=1 Y id i ; n 01 = P n i=1 (1 Y i)d i ; n mis = P n i=1 (1 D i). Clearly, this likelihood function depends on only through the three probability masses, = ( 11 ; 01 ; mis ) ( 11 ; 01 ; ); 2, no matter what the observations are, so that the likelihood for has "data-independent" at regions expressed by a set-valued map of, () f 2 : 11 = 11 ; 01 = 01 ; = mis g : The parameter of interest is the mean of Y; which is written as a function of, Pr(Y = 1) = The identi ed set of, H (), as the set-valued map of is de ned by the range of 5

6 when varies over (), H() = [ 11 ; 11 + mis ]; which are the Manski (1989) s bounds of Pr(Y = 1). In a standard Bayes inference for, we specify a prior of, update it by the Bayes rule, and marginalize the posterior of to. If the likelihood for has data-independent at regions, then the prior for conditional on f 2 ()g (i.e., belief on how the proportion of missing observations mis is divided into 10 and 00 ) will never be updated by data, so that the posterior of as well as become sensitive to how to specify it. The robust Bayes procedure considered in this paper aims to make posterior inference free from such sensitivity concern by introducing multiple priors for. The rst step for constructing our prior class is to specify a single prior for identi ed parameters. In view of, prior speci es how much prior belief should be assigned to each at region of the s likelihood (), whereas, depending on ways to allocate the assigned belief over 2 () (for each ), the implied prior for may di er. Therefore, by collecting all the possible ways to allocate the assigned belief over f 2 () ; 2 g, we can obtain the following class of prior distributions of, M = : ( (B)) = (B) for all B. By applying the Bayes rule to each 2 M and marginalizing the posterior of for, we obtain the class of posteriors of, F jx : 2 M. The class of posteriors is summarized by its lower envelope (lower probability), F jx () = inf 2M( ) F jx (), which is interpreted as that the posterior belief for f 2 g is at least F jx () no matter what 2 M is used, (i.e. no matter how prior beliefs are formed for 10 and 00 given the proportion of missing observations mis ). In the main theorem of this paper, we show F jx () = F jx (f : H () g), where F jx denotes the posterior distribution of implied from the prior. An implication of this result is that the posterior probability law of identi ed sets, H () : F jx, can be used to summarize the posterior ambiguous belief (multiple porsteriors) for implied by prior class M, and this fact plays a key role in our development of statistical decision and set inference for based on the posterior lower probability. With leaving their formal analysis in later sections, let us state brie y how to implement the posterior lower probability inference for. 1. Specify a prior for and update it by the Bayes rule. When a credible prior for is not 6

7 available, a reasonably "non-informative" prior may be used Let f s : s = 1; : : : ; Sg be random draws of from the posterior of. The mean and median of the posterior lower probability of can be de ned via the gamma-minimax decision criterion, and they can be approximated by 1 arg min a S respectively. SX sup s=1 2H( s ) (a ) 2 1 and arg min a S SX sup s=1 2H( s ) ja j ; 3. The posterior lower credible region of at credibility level 1, which can be interpreted as a contour set of the posterior lower probability of, is de ned by the smallest interval that contains H () with posterior probability 1 algorithm to compute it for interval-identi ed cases). (Proposition 5.1 below proposes an Under certain regularity conditions, which are satis ed in the current missing data example, the posterior lower credible region of constructed in this way asymptotically attains the frequentist coverage probability 1 for the true identi ed set. 3 Multiple-prior Analysis and the Lower and Upper Probabilities 3.1 Likelihood and Set Identi cation: The General Framework Let (X; X ) and (; A) be measurable spaces of a sample X 2 X and a parameter vector 2, respectively. Our analytical framework covers both a parametric model = R d, d < 1, and a non-parametric model where is a separable Banach space. We make the sample size implicit in our notation until the large sample analysis in Section 5. Let be a marginal probability distribution on the parameter space (; A), referred to as a prior distribution for. We assume that the conditional distribution of X, given, has the probability density p(xj) at every 2 with respect to a - nite measure on (X; X ). We call p(xj) the likelihood for. The parameter vector may consist of parameters that determine the behaviors of economic agents as well as those that characterize the distribution of unobserved heterogeneities in the population. In the context of the missing data or counterfactual causal models, indexes the distribution of the underlying population outcomes or the potential outcomes (see Example 3.1 below). In all of these cases, the parameter should be distinguished from the parameters that are 1 See Kass and Wasserman (1996) for a survey of reasonably non-informative priors. 7

8 used solely to index the sampling distribution of observations. A problem with the identi cation of typically arises in this context: if multiple values of can generate the same distribution of data, then we claim that these s are observationally equivalent and the identi cation of fails. In terms of the likelihood function, the observational equivalence of and 0 6= means that the values of the likelihood at and 0 are equal for every possible sample, i.e., p(xj) = p(xj 0 ) for every x 2 X (Rothenberg (1971), Drèze (1974), and Kadane (1974)). We represent the observational equivalence relation of s by a many-to-one function g : (; A)! (; B): g() = g( 0 ) if and only if p(xj) = p(xj 0 ) for all x 2 X: The equivalence relationship partitions the parameter space into equivalent classes, in each of which the likelihood of is at, irrespective of observations, and = g() maps each of these equivalent classes to a point in another parameter space. In the language of structural models in econometrics (Hurwicz (1950), and Koopman and Reiersol (1950)), = g() is interpreted as the reduced-form parameter that carries all the information for the structural parameters through the value of the likelihood function. In the literature of Bayesian statistics, = g() is referred to as the minimally su cient parameter (abbreviated to su cient parameter), and the range space of g(), (; B), is called the su cient parameter space (Barankin (1960), Dawid (1979), Florens and Mouchart (1977), Picci (1977), and Florens, Mouchart, and Rolin (1990)). 2 See Florens and Simoni (2011) for comprehensive discussions on the relationship between frequentist and Bayesian identi cation. In the presence of su cient parameters, the likelihood depends on only through the function g(), i.e., there exists a B-measurable function ^p(xj) such that p(xj) = ^p(xjg()) 8x 2 X and 2 (3.1) holds (Lemma of Lehmann and Romano (2005)). Denote the inverse image of g() by : () = f 2 : g() = g, where () and ( 0 ) for 6= 0 are disjoint, and f () ; 2 g constitutes a partition of. We assume that g() is onto, g() = ; so that () is assumed to be non-empty for every The su cient parameter space is unique up to one-to-one transformations (Picci (1977)). 3 In an observationally restrictive model in the sense of Koopman and Reiersol (1950), ^p(xj), i.e., the likelihood 8

9 In the set-identi ed model, the parameter of interest is typically a subvector or a transformation of. We denote parameters of interest by a measurable transformation of, = h(), h : (; A)! (H; D). We de ne the identi ed set of by the projection of () onto H through h(). De nition 3.1 (Identi ed Set of ) (i) The identi ed set of is a set-valued map H : H de ned by H() fh() : 2 ()g : (ii) The parameter = h() is point-identi ed at if H() is a singleton, and is set-identi ed at if H () is not a singleton. Note that identi cation of is de ned in the pre-posterior sense because it is based on the likelihood evaluated at every possible realization of a sample, not only on the observed one. The task of constructing the sharp bounds of in the partially identi ed model is equivalent to nding the expression of H(). 3.2 Examples We now give some examples, both to illustrate the above concepts and notations, and to provide a concrete focus for later development. Example 3.1 (Bounding ATE by Linear Programming) Consider the treatment e ect model with incompliance and a binary instrument 2 f1; 0g, as considered in Imbens and Angrist (1994), and Angrist, Imbens, and Rubin (1996). Assume that the treatment status and the outcome of interest are both binary. Let Y 2 Y be the observed outcome and (Y 1 ; Y 0 ) be the potential outcomes, as de ned in Example 2.1. We denote by (D 1 ; D 0 ) 2 f1; 0g 2 the potential selection responses to the instrument with the observed treatment status D = D 1 + (1 )D 0. The data consist of a random sample of (Y i ; D i ; i ). Following Imbens and Angrist (1994), we partition the population into four subpopulations de ned in terms of potential treatment-selection responses: 8 c if D 1i = 1 and D 0i = 0 : complier, >< at if D 1i = D 0i = 1 : always-taker, T i = nt if D 1i = D 0i = 0 : never-taker, >: d if D 1i = 0 and D 0i = 1 : de er, function for the su cient parameters, is well de ned for a domain larger than g(); for example, see Example 2.3 in Section 2.2. In this case, the model possesses the refutability property, and () can be empty for some 2. We do not consider such refutable models in this paper. 9

10 where T i is the indicator for the types of selection responses that are latent since we cannot observe both (D 1i ; D 0i ). We assume a randomized instrument,? (Y 1 ; Y 0 ; D 1 ; D 0 ). Then, the distribution of observables and the distribution of potential outcomes satisfy the following equalities for y 2 f1; 0g: Pr(Y = y; D = 1j = 1) = Pr(Y 1 = y; T = c) + Pr(Y 1 = y; T = at); (3.2) Pr(Y = y; D = 1j = 0) = Pr(Y 1 = y; T = d) + Pr(Y 1 = y; T = at); Pr(Y = y; D = 0j = 1) = Pr(Y 0 = y; T = d) + Pr(Y 1 = y; T = nt); Pr(Y = y; D = 0j = 0) = Pr(Y 0 = y; T = c) + Pr(Y 1 = y; T = nt): Ignoring the marginal distribution of, a full parameter vector of the model can be speci ed by a joint distribution of (Y 1 ; Y 0 ; T ) that consists of 16 probability masses: = Pr(Y 1 = y; Y 0 = y 0 ; T = t) : y = 1; 0; y 0 = 1; 0; t = c; nt; at; d : Let ATE be the parameter of interest. E(Y 1 Y 0 ) = X [Pr(Y 1 = 1; T = t) t=c;nt;at;d y=1;0 t=c;nt;at;d X X = [Pr(Y 1 = 1; Y 0 = y; T = t) h(): Pr(Y 0 = 1; T = t)] Pr(Y 1 = y; Y 0 = 1; T = t)] The likelihood conditional on depends on only through the distribution of (Y; D), so the su cient parameter vector consists of eight probability masses: = (Pr(Y = y; D = dj = z) : y = 1; 0; d = 1; 0; z = 1; 0) : The set of equations given in (3.2) characterizes the observationally equivalent set of distributions of (Y 1 ; Y 0 ; T ) ; () when the data distribution is given at. Balke and Pearl (1997) derive the identi ed set of ATE, H() = h( ()); by maximizing or minimizing h(), subject to being in the probability simplex and satisfying constraints (3.2). Since the objective function and the constraints are all linear, this optimization can be solved by linear programming and, consequently, H() is obtained as the convex intervals whose lower and upper bounds are the achieved minimum and maximum of the objective function in the linear optimization (see Balke and Pearl (1997) for a closed-form expression of the bounds and Kitagawa (2009) for an extension to general support of Y ). 10

11 Note that, in this model, special attention is needed for the su cient parameter space in order to ensure that the identi ed set of, (), is non-empty. Pearl (1995) shows that the distribution of data is compatible with the instrument exogeneity condition,? (Y 1 ; Y 0 ; D 1 ; D 0 ) ; if and only if max d X y max fpr(y = y; D = d)j = zg 1: (3.3) z This implies that in order to guarantee on the distributions of data that ful ll (3.3). () 6= ;, a prior distribution for must be supported only Example 3.2 (Linear Moment Inequality Model) Consider the model where the parameter of interest 2 H is characterized by linear moment inequalities, E(m(X) A) 0; where the parameter space H is a subset of R L, m(x) is a J-dimensional vector of known functions of an observation, and A is a J L known constant matrix. By augmenting the J-dimesional parameter 2 [0; 1) J, these moment inequalities can be written as the J-moment equalities 4 E(m(X) A ) = 0: We let the full parameter vector = (; ) 2 H [0; 1) J. To obtain a likelihood function for the moment equality model, we employ the exponentially tilted empirical likelihood for that is considered in the Bayesian context in Schennach (2005). Let x n be a size n random sample of observations, and let g() = A +. If the convex hull of [ i fm(x i ) g()g contains the origin, then the exponentially tilted empirical likelihood is written as p(x n j) = Y w i (); where w i () = exp f(g()) 0 (m(x i ) g())g P n i=1 exp f(g())0 (m(x i ) g())g ; (g()) = arg min 2R J + ( nx exp 0 (m(x i ) i=1 g()) ) : 4 The Bayesian formulation of the moment inequality model shown here is owed to Tony Lancaster, who suggested this to us in

12 Thus, the parameter = (; ) enters the likelihood only through g() = A+, so we take = g() as the su cient parameters. The identi ed set for is given by () = (; ) 2 H [0; 1) L : A + = : The coordinate projection of () onto H gives H(), the identi ed set for (see, e.g., Bertsimas and Tsitsiklis (1997, Chap.2) for an algorithm for projecting a polyhedron). 3.3 A Class of Unrevisable Prior Knowledge Let be a prior of and be the marginal probability measure on the su cient parameter space (; B) implied by and g(), i.e., (B) = ( (B)) for all B 2 B. Let x 2 X be sampled data, and assume that has a dominating measure. The posterior distribution of, denoted by F jx (), is then obtained as F jx (A) = j (Aj)dF jx (); A 2 A, (3.4) where j (Aj) denotes the conditional distribution of given, and F jx () is the posterior distribution of (Barankin (1960), Florens and Mouchart (1974), Picci (1977), Dawid (1979), Poirier (1998)). The posterior distribution of given in (3.4) shows that we are incapable of updating the conditional prior information of given because of the at likelihood on (). Accordingly, we can consider the prior information marginalized to the su cient parameter as the revisable prior knowledge, which we can potentially update by data, and the conditional priors of given, n o j (j) : 2, as the unrevisable prior knowledge, which we are never be able to update by data. If we want to obtain the posterior uncertainty of in the form of a probability distribution on (; A), as desired in the Bayesian paradigm, we need to have a single prior distribution of, which necessarily induces unique unrevisable prior knowledge j. If the researcher could justify his choice of j by any credible prior information, the standard Bayesian updating (3.4) would yield a valid posterior distribution of : A challenging situation would arise if he is short of a credible prior opinion on. In this case, the researcher, who is aware that j will never be updated by data, might feel anxious in implementing the Bayesian inference procedure that requires him to 12

13 specify unique j, because an uncon dently speci ed j can have a signi cant in uence to the subsequent posterior inference. The robust Bayes analysis in this paper particularly focuses on such situation. Let M be the set of probability measures on (; A), and be a prior on (; B) speci ed by the researcher. Consider the class of prior distributions of de ned by M( ) = 2 M : ( (B)) = (B) for every B 2 B : M( ) consists of prior distributions of whose marginal for the su cient parameters coincides with the prespeci ed. 5 That is, we accept specifying a single prior distribution for the su cient parameters, while we leave conditional priors j unspeci ed and allow for arbitrary ones as long as () = R j(j)d is a probability measure on (; A). 6 This prior class may be understood more intuitively if we consider a discrete su cient parameter space = f 1 ; : : : ; K g. Constraint ( ( k )) = ( k ) in the de nition of M( ) says that prior probability for su cient parameter; ( k ) for each k = 1; : : : ; K, speci es one s belief for f 2 ( k )g. On the other hand, conditional prior j (j k ) is interpreted as his prior belief for conditional on knowing that the lies in ( k ). Arbitrary probability distributions for conditional prior j (j k ) means any probability distributions of j (j k ) including a probability mass are allowed as long as it is supported only on ( k ), and no constraints imposed across any j (j k ) and j (j k 0), k 6= k 0. We can therefore express M( ) as ( K ) X M( ) = j (j k ) ( k ) : j (j k ) are prob. measures for supported only on ( k ). k=1 Prior class M( ) induces the prior class for = h(), which is represented by ( K ) X M ( ) = j (j k ) ( k ) : j (j k ) are prob. measures for supported only on H ( k ) : k=1 This representation of M ( ) o ers heuristic interpretation on the ambiguous belief of implied by M( ); the researcher has a single prior belief on which identi ed set H ( k ) is true in the population, while, he is lack of belief on where true likely to be conditional on knowing that lies 5 We thank Jean-Pierre Florens for suggesting this way of representing the prior class. 6 Su cient parameters are de ned by examining the entire model fp (xj) : x 2 X ; 2 g, so that prior class M( ) is by construction model dependent. This distinguishes our approach from the standard robust Bayes analysis where a prior class is intended to represent the researcher s subjective assessment of his imprecise prior knowledge (Berger (1985)). 13

14 in H ( k ) for every k = 1; : : : ; K. That is, the ambiguous prior knowedge for is interpreted as the n o K state of knowledge that one cannot assess which combination of conditional priors j (j k ) is more credible than other combinations, and he considers every possible combination of conditional n o K priors j (j k ) as plausible. The measurable selection argument in random set theory (see, k=1 e.g., Molchanov (2005, Chap.1, Sec2.2)), which is used also in the proof of Theorem 3.1 below, allows us to extend the interpretation of ambiguity stated here to a general space of. k=1 Some readers might wonder if it is sensible to apply some of the existing selection rules for non-informative priors for instead of introducing the class of priors. First of all, Je reys s general rule (Je reys (1961)), which takes the prior density to be proportional to the square root of the determinant of the Fisher information matrix, is not well de ned if the information matrix for is singular at almost every 2, which is the case in many partially identi ed model, e.g., the missing data example of Section 2. The empirical Bayes-type approach of choosing a prior for (Robbins (1964), Good (1965), Morris (1983), and Berger and Berliner (1986)) also breaks down if the model lacks identi cation because the marginal likelihood of data depends only on, p(xj)d = ^p(xj)d m(xj ): (3.5) It is also worth noting that we cannot obtain the reference prior of Bernardo (1979) because the conditional Kullback Leibler distance between the posterior density f jx () and a prior density d (): fjx () log d () ^p(xj) df jx () = log m(xj ) again depends only on for every realization of X. ^p(xj) m(xj ) d ; These prior selection rules are therefore useful for choosing a prior for the su cient parameters, and are of no use for selecting out of M( ). In the analysis given below, we shall not discuss how to select, and shall treat as given. The in uence of on the posterior of will diminish as the sample size increases, so the sensitivity issue of the posterior of is expected to be less severe when the sample size is moderate or large. 3.4 Posterior Lower and Upper Probabilities The prior class M( ) yields the class of posterior distributions of. We rst consider summarizing the posterior class by the posterior lower probability F jx () and the posterior upper probability 14

15 FjX (), which are de ned, for A 2 A, as F jx (A) inf 2M( ) F jx(a); FjX (A) sup F jx (A): 2M( ) Note that the posterior lower probability and the upper probability have a conjugate property, F jx (A) = 1 F jx (Ac ), so it su ces to focus on one of them in deriving their analytical form. For the lower and upper probabilities to be well de ned, we assume the following regularity conditions. Condition 3.1 (i) A prior for,, is a proper probability measure on (; B), and it is absolutely continuous with respect to a - nite measure on (; B) : (ii) g : (; A)! (; B) is measurable and its inverse image every 2. () is a closed set in, -almost (iii) h : (; A)! (H; D) is measurable and H () = h ( ()) is a closed set in H, -almost every 2 : These conditions are imposed in order for the identi ed sets () and H () to be interpreted as random closed sets in and H induced by a probability law on (; B). 7 The closedness of () and H () are implied, for instance, by continuity of g () and h (). Under these conditions, we can guarantee that for each A 2 A, f : () \ A = ;g and f : () Ag are supported by a probability measure on (; B) : Theorem 3.1 Assume Condition 3.1. (i) For each A 2 A, F jx (A) = F jx (f : () Ag); (3.6) FjX (A) = F jx (f : () \ A 6= ;g) ; (3.7) 7 The inference procedure proposed in this paper can be implemented as long as the posterior of is proper, while we do not know how to accommodate an improper prior for in our development of the analytical results. 15

16 where F jx (B) is the posterior probability measure of, F jx (B) = R B f jx()d for B 2 B. (ii) De ne the posterior lower and upper probabilities of = h () by, for each D 2 D, F jx (D) inf 2M( ) F jx(h 1 (D)); FjX (D) sup F jx (h 1 (D)): 2M( ) These are given by F jx (D) = F jx (f : H() Dg); F jx (D) = F jx(f : H() \ D 6= ;g): Proof. For a proof of (i), see Appendix A. For a proof of (ii), see equation (3.8) below. The expression for F jx (A) given above implies that the posterior lower probability on A calculates the probability that the set () is contained in subset A in terms of the posterior probability law of. On the other hand, the upper probability is interpreted as the posterior probability that the set () hits subset A. Since the probability law of random sets is uniquely characterized by the containment probability, as in (3.6), or the hitting probability, as in (3.7) (the Choquet Theorem), these results show that the posterior lower and upper probabilities for and with our speci cation of the prior class represents a unique probability law of the identi ed sets of and. The second statement of the theorem provides a procedure for transforming or marginalizing the lower and upper probabilities of into those of the parameter of interest. The expressions of F jx (D) and FjX (D) are simple and easy to interpret: the lower and upper probabilities of = h() are the containment and hitting probabilities of the random sets obtained by projecting () through h(). This marginalization rule of the lower probability follows from F jx (D) = F jx (h 1 (D)) = F jx ( : () h 1 (D) ) = F jx (f : H() Dg), (3.8) and the marginalization rule for the upper probability is obtained similarly. It is worth noting that the operation of marginalizing the posterior information of to is very di erent between the lower probability inference and the standard Bayesian inference. In 16

17 the standard Bayesian inference with a single posterior, marginalization of the posterior of to is done by integrating the posterior probability measure of for, while in the lower probability inference, marginalization for corresponds to obtaining the probability law of H () the projected random set of () by = h (). As is well known in the literature, the lower and upper probabilities of a set of probability measures are in general a monotone and nonadditive measure (capacity). In our speci cation of prior class, the closed form expressions of Theorem 3.1 assures that the resulting posterior lower and upper probabilities are supermodular and submodular, respectively. Corollary 3.1 Assume Condition 3.1. The posterior lower and upper probabilities of are supermodular and submodular, respectively. For A 1, A 2 2 A subsets in, F jx (A 1 [ A 2 ) + F jx (A 1 \ A 2 ) F jx (A 1 ) + F jx (A 2 ); FjX (A 1 [ A 2 ) + FjX (A 1 \ A 2 ) FjX (A 1) + FjX (A 2): Also, the posterior lower and upper probabilities of are supermodular and submodular, respectively. For D 1, D 2 2 D subsets in H, F jx (D 1 [ D 2 ) + F jx (D 1 \ D 2 ) F jx (D 1 ) + F jx (D 2 ); FjX (D 1 [ D 2 ) + FjX (D 1 \ D 2 ) FjX (D 1) + FjX (D 2): The result of Theorem 3.1 (i) can be seen as a special case of Wasserman s (1990) general construction of the posterior lower and upper probabilities. An important di erence from Wasserman (1990) is that with our way of specifying prior class M, we can guarantee that the lower probability of the posterior class is an 1-order monotone capacity (a containment functional of random sets). 8 Submodularity property for the posterior upper probability implied from this result is crucial for simplifying the gamma minimax analysis of Section Wasserman (1990, p.463) posed an open question asking which class of priors can assure the posterior lower probability to be a Shafer s belief function (equivalently, a containment functional of random sets). Theorem 3.1 gives an answer to his open question in the situation where the model lacks identi ability. 9 To the best of my knoweldge, it is not yet known if submodularity of the posterior upper probability generally holds or not in the general setup of Wasserman (1990). prior class, we obtain a submodular posterior upper probability. Corollary 3.1 demonstrates that, with our speci cation of 17

18 4 Point Decision of = h() with Multiple Priors In the standard Bayes posterior analysis, the mean or median of a posterior distribution of is often reported as a point estimate of. If we summarize the posterior information of in terms of its posterior lower and upper probabilities instead of a posterior distribution, how should we construct a reasonable point estimator for? We consider a setting where a statistical decision is made after observing data. be an action, where H a is an action space. Let a 2 H a We interpret an action as reporting a non-randomized point estimate for, so that we assume action space H a is a subset of H. Given an action a to be taken, and 0 being the true state of nature, a loss function L( 0 ; a) : H H a! R + yields the cost to the decision maker of taking such an action. function L(; a) is non-negative and D-measurable for each a 2 H a. We assume loss Given a single prior for,, the posterior risk is de ned by ( ; a) L(; a)df jx (); (4.1) H where the rst argument in the posterior risk represents the dependence of the posterior of on the speci cation of the prior for. Our analysis involves multiple priors, so the class of posterior risks ( ; a) : 2 M( ) is considered. The posterior gamma-minimax criterion 10 ranks actions in terms of the worst case posterior risk (upper posterior risk): ( ; a) sup ( ; a), 2M( ) where the rst argument in the upper posterior risk represents the dependence of the prior class on a prior for. De nition 4.1 A posterior gamma-minimax action a x with respect to prior class M is an action that minimizes the upper posterior risk, i.e., ( ; a x) = inf a2h a ( ; a) = inf sup a2h a 2M( ) ( ; a). 10 In the robust Bayes literature, the class of prior distributions is often denoted by, and this is why it is called the (conditional) gamma-minimax criterion. Unfortunately, in the literature of belief functions and lower and upper probabilities, often denotes a set-valued mapping that generates the lower and upper probabilities. In this paper, we adopt the latter notational convention, but still refer to the decision criterion as the gamma-minimax criterion. 18

19 The gamma-minimax decision approach involves a favor for a conservative action that guards against the least favorable prior within the class, and it can be seen as a compromise of the Bayesian decision principle and the minimax decision principle. The next proposition shows that minimizing the upper posterior risk ( ; a) is equivalent to minimizing the Choquet expected loss with respect to the posterior upper probability. Proposition 4.1 Under Condition 3.1, the upper posterior risk satis es ( ; a) = L(; a)dfjx () = sup L(; a)f jx (jx)d, (4.2) 2H() whenever R L(; a)dfjx () < 1. Accordingly, a gamma-minimax decision a x with respect to prior class M exists if and only if EjX sup 2H() L(; a) has a minimizer in a 2 H a. Proof. See Appendix A. The third expression in (4.2) shows that the posterior gamma-minimax criterion is written as the expectation of the worst-case loss function, sup 2H() L(; a), with respect to the posterior of. The supremum part stems from the ambiguity of : given, what the researcher knows about is only that it lies within the identi ed set H () ; so the researcher forms the loss by supposing that the nature chooses the worst case in response to his action a. On the other hand, the expectation in represents the posterior uncertainty of the identi ed set H (): with the nite number of observations, the identi ed set of is a random set induced by the posterior of. The gammaminimax criterion with the class of priors M combines such ambiguity of with the posterior uncertainty of the identi ed set H () to yield a single objective function to be minimized. Although a closed-form expression of a x is not in general available, this proposition suggests a simple numerical algorithm for approximating a x using a random sample of from its posterior f jx (jx). Let f s g S s=1 be S random draws of from posterior f jx(). We shall approximate a x by ^a 1 x arg min a2h a S SX sup s=1 2H( s ) L(; a). The reader may wonder if the posterior gamma minimax action a x can be interpreted as a Bayes action for some posterior distributions in the class. In case of the quadratic loss, the saddle-point 19

20 argument as in Grünwald and Dawid (2003) shows that the gamma-minimax action a x corresponds to the mean of a posterior distribution (Bayes action) that has maximal variance in the class. Such maximal variance posterior for may be obtained by allocating the posterior probability at each to some boundary points of H (). 11 In the presence of multiple priors, it is known that a posteriori optimal gamma-minimax action as considered here can di er from an a priori optimal gamma-minimax decision (dynamic inconsistency problem, Vidakovic (2000)). Interestingly, with our speci cation of prior class, such dynamic inconsistency problem does not arise, so the gamma-minimax action obtained above is supported as an optimal gamma-minimax decision in the unconditional sense. For details and a proof of this claim, see Appendix B. As an alternative to the (posterior) gamma-minimax criterion considered above, it is possible to consider the gamma-minimax regret criterion (Berger (1985, p. 218), and Rios Insua, Ruggeri, and Vidakovic (1995)), which is seen as an extension of the minimax regret criterion of Savage (1951) to the multiple-prior Bayes context. In Appendix B, we provide some analytical results of the gamma-minimax regret analysis where the parameter of interest is a scalar and the loss function is quadratic, L(; a) = ( a) 2. We demonstrate that the gamma-minimax regret decision can di er from the gamma-minimax decision derived above, but that they coincide asymptotically. 5 Set Estimation of This section discusses how to use the posterior lower probability of to conduct set estimation of. In the standard Bayesian inference, set estimation is often done by reporting contour sets of the posterior probability density of (highest posterior density region). If the posterior information for is summarized by the lower and upper probabilities, how should we conduct set estimation of? 11 A common criticism to the gamma-minimax analysis is that a posterior that yields the gamma minimax action as a Bayes action does not appear appealing. If this criticism applies to our context, it would make sense to constrain the prior class, or to seek for a more appealing decision criterion. Further discussion related to this point is provided in Section

21 5.1 Posterior Lower Credible Region For 2 (0; 1), consider a subset C 1 H such that the posterior lower probability F jx (C 1 ) is greater than or equal to 1 : F jx (C 1 ) = F jx (H() C 1 )) 1. (5.1) In words, C 1 is interpreted as a set on which the posterior credibility of is at least 1, no matter which posterior is chosen within the class. If we drop the italicized part from this statement, we obtain the usual interpretation of the posterior credible region, so C 1 de ned in this way seems to be a natural way to extend the Bayesian posterior credible regions to those of the posterior lower probability. Analogous to the Bayesian posterior credible region, there are also multiple C 1 s that satisfy (5.1). For instance, given a posterior credibility region of with credibility 1, B 1, C 1 = [ 2B1 H () suggested in an earlier version of Moon and Schorfheide (2011) satis es (5.1). In our proposal of set inference for, we restrict multiplicity of set estimates as well as penalize their volume by focusing on a smallest one among C 1 s satisfying (5.1), C 1 arg min Leb(C) (5.2) C2C s.t. F jx (H() C)) 1 ; where Leb(C) is the volume of subset C in terms of the Lebesgue measure and C is a family of subsets in H over which the volume-minimizing lower credible region is searched. Analogous to the highest posterior density region in the standard Bayesian inference, focusing on the smallest set estimate has a decision theoretic justi cation in terms of gamma-minimax criterion. 12 C 1 as a posterior lower credible region with credibility 1. We refer to 12 Focusing on smallest C 1 has a decision theoretic justi cation. We can show that under mild conditions C 1 can be supported as a gamma minimax action (if it exists), 2 3 C 1 = arg min 4 sup L (; C) df jx 5 C2C 2M( ) with loss function, L (; C) = Leb (C) + b () [1 1 C ()] ; where b () is a positive constant that depends on credibility level 1 terms of parameter ; so the object of interest is rather than identi ed set H ().. Note that the loss function is written in 21

22 Finding C 1 is challenging if is multi-dimensional and no restriction is placed on class of subsets C. In what follows, we restrict our analysis to scalar and constrain C to the class of closed connected intervals. Let C r ( c ) be a closed interval in H, centered at c 2 H, with width 2 r: Accordingly, the constrained minimization problem of (5.2) is reduced to min r r; c s.t. F jx (H() C r ( c )) 1 : (5.3) The next proposition provides an alternative way to formulate this optimization problem, which will be useful for obtaining C 1 numerically. Proposition 5.1 Let d : H D! R + measures distance from c 2 H to set H () in terms of d ( c ; H()) sup fk c kg. 2H() For each c 2 H, let r 1 ( c ) be the (1 )-th quantile of the distribution of d ( c ; H()) induced by the posterior distribution of, i.e., r 1 ( c ) inf r : F jx (f : d( c ; H()) rg) 1. Then, the solution of the constrained minimization problem (5.3) is given by (r 1 ; c), where c = arg min c 2H r 1 ( c ) and r 1 = r 1 ( c). Proof. See Appendix A. Let f s : s = 1; : : : ; Sg be random draws of from its posterior. At each c 2 H, we rst calculate ^r 1 ( c ), the empirical (1 )-th quantile of d( c ; H()), based on the simulated d( c ; H( s )), s = 1; : : : ; S. Provided that r 1 ( c ) is continuous in c, the obtained empirical (1 )-th quantile, ^r 1 ( c ), is a valid approximation of the underlying (1 )-th quantile r 1 ( c ), so that we can approximate (r1 ; c) by searching a minimizer and the minimized value of ^r 1 ( c ). 5.2 Asymptotic Property of the Posterior Lower Credible Region This section examines the large-sample behavior of the posterior lower probability in an intervally identi ed case, i.e., H () R is a closed bounded interval for almost all 2. From now on, we 22

23 make the sample size explicit in our notation: a size n sample X n is generated from its sampling distribution P X n j 0, where 0 denotes the value of the su cient parameters that corresponds to the true data-generating process. We denote the maximum likelihood estimator for by ^. Provided that the posterior of is consistent 13 to 0 and set-valued map H ( 0 ) is continuous at = 0 in terms of the Hausdor metric d, it is straightforward to show that random sets H () represented by the posterior lower probability F jx n () is consistent to true identi ed set H ( 0 ) in the sense of lim n!1 F jx n (f : d (H () ; H ( 0 )) > g) = 0 for almost every sampling sequence of fx n g. Our main focus in what follows is therefore on the asymptotic coverage property of the lower credible region C 1 constructed in the previous section. Our main analytical result rely on the following set of regularity conditions. Condition 5.1 (i) The parameter of interest is a scalar and the identi ed set H() is -almost surely a nonempty and connected interval, i.e., H () = [ l (); u ()], 1 l () u () 1. The true identi ed set H ( 0 ) = [ l ( 0 ); u ( 0 )] is a bounded interval. (ii) For sequence a n! 1, de ne random variables L n () = a n l () l (^) and U n () = a n u () u (^), whose distribution is induced by the posterior distribution of given sample X n. The cumulative distribution function of (L n () ; U n ()) given X n denoted by J n () is continuous almost surely under ^p (x n j 0 ) for all n. (iii) There exists bivariate random variables (L; U), whose cumulative distribution function J () on R 2 is continuous, and at each c (c l ; c u ) 2 R 2, the cumulative distribution function of (L n () ; U n ()) given X n, J n (c) ; converges in probability under P X n j 0 to J (c). (iv) The sampling distribution of converges in distribution to (L; U) : a n l ( 0 ) l (^) ; a n u ( 0 ) u (^) Conditions 5.1 (iii) and (iv) state that the estimator for the lower and upper bounds of H () satisfy the Bernstein von Mises asymptotic equivalence between the Bayesian estimation and the 13 What we mean for posterior consistency of is that lim n!1 F jx n (G) = 1 for every G open neighborhood of 0 for almost every sampling sequence. In case of nite dimensional, this posterior consistency for is implied by a set of higher-level conditions for the likelihood of. We do not list up all those conditions here for the sake of brevity. See Section 7.4 of Schervish (1995) for details. 23

Estimation and Inference for Set-identi ed Parameters Using Posterior Lower Probability

Estimation and Inference for Set-identi ed Parameters Using Posterior Lower Probability Toru Kitagawa Department of Economics, University College London 28, July, 2012 Abstract In inference for set-identi