arxiv: v1 [stat.me] 4 Sep 2013

Size: px

Start display at page:

Download "arxiv: v1 [stat.me] 4 Sep 2013"

Kelly Hudson
5 years ago
Views:

1 A BAYESIAN INFORMATION CRITERION FOR SINGULAR MODELS MATHIAS DRTON AND MARTYN PLUMMER arxiv: v1 [stat.me] 4 Sep 013 Abstract. We consider approximate Bayesian model choice for model selection problems that involve models whose Fisher-information matrices may fail to be invertible along other competing submodels. Such singular models do not obey the regularity conditions underlying the derivation of Schwarz s Bayesian information criterion (BIC) and the penalty structure in BIC generally does not reflect the frequentist large-sample behavior of their marginal likelihood. While large-sample theory for the marginal likelihood of singular models has been developed recently, the resulting approximations depend on the true parameter value and lead to a paradox of circular reasoning. Guided by examples such as determining the number of components of mixture models, the number of factors in latent factor models or the rank in reduced-rank regression, we propose a resolution to this paradox and give a practical extension of BIC for singular model selection problems. 1. Introduction Information criteria are classical tools for model selection. At a high-level, they fall into two categories (Yang, 005). On one hand, there are criteria that target good predictive behavior of the selected model; the information criterion of Akaike (1974) and cross-validation based scores are examples. The Bayesian information criterion (BIC) of Schwarz (1978), on the other hand, draws motivation from Bayesian approaches. From the frequentist perspective, it has been shown in a number of settings that the BIC is consistent. In other words, under optimization of BIC the probability of selecting a fixed most parsimonious true model tends to one as the sample size tends to infinity (e.g., Nishii, 1984, Haughton, 1988, 1989). From a Bayesian point of view, the BIC yields rather crude but computationally inexpensive approximations to otherwise difficult to calculate posterior model probabilities in Bayesian model selection/averaging; see Kass and Wasserman (1995), Raftery (1995), DiCiccio et al. (1997) or Hastie et al. (009, Chap. 7.7). In this paper, we are concerned with Bayesian information criteria in the context of singular model selection problems, that is, problems that involve models with Fisher-information matrices that may fail to be invertible. For example, due to the break-down of parameter identifiability, the Fisher-information matrix of a mixture model with three component distributions is singular at a distribution that can be obtained by mixing only two components. This clearly presents a fundamental challenge for selection of the number of components. Other important examples of this type include determining the rank in reduced-rank regression, the number of Key words and phrases. Bayesian information criterion, factor analysis, mixture model, model selection, reduced-rank regression, singular learning theory, Schwarz information criterion. 1

2 MATHIAS DRTON AND MARTYN PLUMMER factors in factor analysis or the number of states in latent class or hidden Markov models. More generally, all the classical hidden/latent variable models are singular. As demonstrated by Steele and Raftery (010) for Gaussian mixture models or Lopes and West (004) for factor analysis, BIC can be a state-of-the-art method for singular model selection. However, while BIC is known to be consistent in these and other singular settings (Keribin, 000, Drton et al., 009, Chap. 5.1), the technical arguments in its Bayesian-inspired derivation do not apply. In a nutshell, when the Fisher-information is singular, the log-likelihood function does not admit a largesample approximation by a quadratic form. Consequently, the BIC does not reflect the frequentist large-sample behavior of the Bayesian marginal likelihood of singular models (Watanabe, 009). In contrast, this paper develops a generalization of BIC that is not only consistent but also maintains a rigorous connection to Bayesian model choice in singular settings. The generalization is honest in the sense that the new criterion coincides with Schwarz s when the model is regular. The new criterion, which we abbreviate to sbic, is presented in Section 3. It relies on theoretical knowledge about the large-sample behavior of the marginal likelihood of the considered models. Section reviews the necessary background on this theory as developed by Watanabe (009). Consistency of sbic is shown in Section 4, and the connection to Bayesian methods is developed in Section 5. In the numerical examples in Section 6, sbic achieves improved statistical inferences while keeping computational cost low. Concluding remarks are given in Section 7.. Background Let Y n = (Y n1,..., Y nn ) denote a sample of n independent and identically distributed observations, and let {M i : i I} be a finite set of candidate models for the distribution of these observations. For a Bayesian treatment, suppose that we have positive prior probabilities P (M i ) for the models and that, in each model M i, a prior distribution P (π i M i ) is specified for the probability distributions π i M i. Write P (Y n π i, M i ) for the likelihood of Y n under data-generating distribution π i from model M i. Let (.1) L(M i ) := P (Y n M i ) = P (Y n π i, M i ) dp (π i M i ). M i be the marginal likelihood of model M i. Bayesian model choice is then based on the posterior model probabilities P (M i Y n ) P (M i )L(M i ), i I. The probabilities P (M i Y n ) can be approximated by various Monte Carlo procedures, see Friel and Wyse (01) for a recent review, but practitioners also often turn to computationally inexpensive proxies suggested by large-sample theory. These proxies are based on the asymptotic properties of the sequence of random variables L(M i ) obtained when Y n is drawn from a data-generating distribution π 0 M i, and we let the sample size n grow. In practice, a prior distribution P (π i M i ) is typically specified by parametrizing M i and placing a distribution on the involved parameters. So assume that (.) M i = { π i (ω i ) : ω i Ω i } with d i -dimensional parameter space Ω i R di, and that P (π i M i ) is the transformation of a distribution P (ω i M i ) on Ω i under the map ω i π i (ω i ). The

3 A BAYESIAN INFORMATION CRITERION FOR SINGULAR MODELS 3 marginal likelihood then becomes the d i -dimensional integral (.3) L(M i ) = P (Y n π i (ω i ), M i ) dp (ω i M i ). Ω i The observation of Schwarz and other subsequent work is that, under suitable technical conditions on the model M i, the parametrization ω i π i (ω i ) and the prior distribution P (ω i M i ), it holds for all π 0 M i that (.4) log L(M i ) = log P (Y n ˆπ i, M i ) d i log(n) + O p(1). Here, P (Y n ˆπ i, M i ) is the maximum of the likelihood function, and O p (1) stands for a remainder that is bounded in probability, i.e., uniformly tight as the sample size n grows. The first two terms on the right-hand side of (.4) are functions of the data Y n and the model M i alone and may thus be used as a model score or proxy for the logarithm of the marginal likelihood. Definition.1. The Bayesian or Schwarz s information criterion for model M i is BIC(M i ) = log P (Y n ˆπ i, M i ) d i log(n). Briefly put, the large-sample behavior from (.4) relies on the following properties of regular problems. First, with high probability, the integrand in (.3) is negligibly small outside a small neighborhood of the maximum likelihood estimator of ω i. Second, in such a neighborhood, the log-likelihood function log P (Y n π i (ω i ), M i ) can be approximated by a negative definite quadratic form, while a smooth prior P (ω i M i ) is approximately constant. The integral in (.3) may thus be approximated by a Gaussian integral, whose normalizing constant leads to (.4). We remark that this approach also allows for estimation of the remainder term in (.4), giving a Laplace approximation with error O p (n 1/ ); compare e.g., Tierney and Kadane (1986), Haughton (1988), Kass and Wasserman (1995), Wasserman (000). A large-sample quadratic approximation to the log-likelihood function is not possible, however, when the Fisher-information matrix is singular. Consequently, the classical theory alluded to above does not apply to singular models. Indeed, (.4) is generally false in singular models. Nevertheless, asymptotic theory for the marginal likelihood of singular models has been developed over the last decade, culminating in the monograph of Watanabe (009). Theorem 6.7 in Watanabe (009) shows that a wide variety of singular models have the property that, for Y n drawn from π 0 M i, (.5) log L(M i ) = log P (Y n π 0, M i ) λ i (π 0 ) log(n) + [ m i (π 0 ) 1 ] log log(n) + O p (1); see also the introduction to the topic in Drton et al. (009, Chap. 5.1). If the sequence of likelihood ratios P (Y n ˆπ i, M i )/P (Y n π 0, M i ) is bounded in probability, then we also have that (.6) log L(M i ) = log P (Y n ˆπ i, M i ) λ i (π 0 ) log(n) + [ m i (π 0 ) 1 ] log log(n) + O p (1).

4 4 MATHIAS DRTON AND MARTYN PLUMMER For singular submodels of exponential families such as the reduced-rank regression and factor analysis models treated later, the likelihood ratios converge in distribution and are thus bounded in probability (Drton, 009). For more complicated models, such as mixture models, likelihood ratios can often be shown to converge in distribution under compactness assumptions on the parameter space; compare e.g. Azaïs et al. (006, 009). Such compactness assumptions also appear in the derivation of (.5). We will not concern ourselves further with the details of these issues as the main purpose of this paper is to describe a statistical method that can leverage mathematical information in the form of equation (.6). The quantity λ i (π 0 ) is known as the learning coefficient (or also real log-canonical threshold or stochastic complexity) and m i (π 0 ) is its multiplicity. In the analytic settings considered in Watanabe (009), it holds that λ i (π 0 ) is a rational number in [0, d i /] and m i (π 0 ) is an integer in {1,..., d i }. We remark that in singular models it is very difficult to estimate the O p (1) remainder term in (.6). We are not aware of any successful work on higher-order approximations in statistically relevant settings. Example.1. Reduced-rank regression is multivariate linear regression subject to a rank constraint on the matrix of regression coefficients (Reinsel and Velu, 1998). Keeping only with the most essential structure, suppose we observe n independent copies of a partitioned zero-mean Gaussian random vector Y = (Y 1, Y ), with Y 1 R N and Y R M, and where the covariance matrix of Y and the conditional covariance matrix of Y 1 given Y are both the identity matrix. The reduced-rank regression model M i associated to an integer i 0 postulates that the N M matrix π in the conditional expectation E[Y 1 Y ] = πy has rank at most i. In a Bayesian treatment, consider the parametrization π = ω ω 1, with absolutely continuous prior distributions for ω R N i and ω 1 R i M. Let the true datagenerating distribution be given by the matrix π 0 of rank j i. Aoyagi and Watanabe (005) derived the learning coefficients λ i (π 0 ) and their multiplicities m i (π 0 ) for this setup. In particular, λ i (π 0 ) and m i (π 0 ) depend on π 0 only through the true rank j. For a concrete instance, take N = 5 and M = 3. Then the multiplicity m i (π 0 ) = 1 unless i = 3 and j = 0 in which case m i (π 0 ) =. The values of λ i (π 0 ) are: j = 0 j = 1 j = j = 3 i = i = 1 9 i = 3 i = Note that the table entries for j = i are equal to dim(m i )/, where dim(m i ) = i(n + M i) is the dimension of M i, which can be identified with the set of N M matrices of rank at most i. The dimension is also the maximal rank of the Jacobian of the map (ω 1, ω ) ω ω 1. The singularities of M i correspond to the points where the Jacobian fails to have maximal rank. These have rank(ω ω 1 ) < i. The fact that the singularities correspond to a drop in rank presents a challenge for model selection, which here amounts to selection of an appropriate rank. Simulation studies on rank selection have shown that the standard BIC, with d i = dim(m i ) in Definition.1, has a tendency to select overly small ranks; for a 15

5 A BAYESIAN INFORMATION CRITERION FOR SINGULAR MODELS 5 recent example see Cheng and Phillips (01). The quoted values of λ i (π 0 ) give a theoretical explanation as the use of dimension in BIC leads to overpenalization of models that contain the true data-generating distribution but are not minimal in that regard. Determining learning coefficients can be a challenging problem, but progress has been made. For some of the examples that have been treated, we refer the reader to Aoyagi (010a,b, 009), Watanabe and Amari (003), Watanabe and Watanabe (007), Rusakov and Geiger (005), Yamazaki and Watanabe (003, 005, 004), and Zwiernik (011). The use of techniques from computational algebra and combinatorics is emphasized in Lin (011); see also Arnol d et al. (1988), Vasil ev (1979). The mentioned theoretical progress, however, does not readily translate into practical statistical methodology because one faces the obstacle that the learning coefficients depend on the unknown data-generating distribution π 0, as indicated in our notation in (.6). For instance, for the problem of selecting the rank in reduced-rank regression (Example.1), the Bayesian measure of model complexity that is given by the learning coefficient and its multiplicity depends on the rank we wish to determine in the first place. It is for this reason that there is currently no statistical method that takes advantage of theoretical knowledge about learning coefficients. In the remainder of this paper, we propose a solution for how to overcome the problem of circular reasoning and give a practical extension of the Bayesian information criterion to singular models. 3. New Bayesian information criterion for singular models If the true data-generating distribution π 0 was known, then (.6) would suggest replacing the marginal likelihood L(M i ) by (3.1) L π 0 (M i ) := P (Y n ˆπ i, M i ) n λi(π0) (log n) mi(π0) 1. The data-generating distribution being unknown, however, we propose to follow the standard Bayesian approach and to assign a probability distribution Q i to the distributions in model M i. We then eliminate the unknown distribution π 0 by marginalization. In other words, we compute an approximation to L(M i ) as (3.) L Q i (M i ) := L π 0 (M i ) dq i (π 0 ). M i The crux of the matter now becomes choosing an appropriate measure Q i. Before discussing particular choices for Q i, we stress that any choice for Q i reduces to Schwarz s criterion in the regular case. Proposition 3.1. If the model M i is regular, then it holds for all probability measures Q i on M i that L Q i (M i ) = e BIC(Mi). Proof. In our context, a regular model with d i parameters satisfies λ i (π 0 ) = d i / and m i (π 0 ) = 1 for all data-generating distributions π 0 M i. Hence, the integrand in (3.) is constant and equal to L π 0 (M i ) = e BIC(Mi).

6 6 MATHIAS DRTON AND MARTYN PLUMMER Returning to the singular case, one possible candidate for Q i is the posterior distribution P (π 0 M i, Y n ). Under this distribution, however, the singular models encountered in practice have the learning coefficient λ i (π 0 ) almost surely equal to dim(m i )/ with multiplicity m i (π 0 ) = 1; recall Example.1. 1 We obtain that log L Q i (M i ) = log P (Y n ˆπ i, M i ) dim(m i) log(n), which is the usual BIC, albeit with the possibility that dim(m i ) < d i, where d i is the dimension of the parameter space Ω i when M i is presented as in (.). From a pragmatic point of view, this choice of Q i is not attractive as it merely recovers the adjustment from d i to dim(m i ) that is standard practice when applying Schwarz s BIC to singular models. More importantly, however, averaging with respect to the posterior distribution P (π 0 M i, Y n ) involves conditioning on the single model M i, which clearly ignores the uncertainty regarding the choice of model that is inherent in model selection problems. In most practical problems, the finite set of models {M i : i I} has interesting structure with respect to the partial order given by inclusion. For notational convenience, we define the poset structure on the index set I and write i j when M i M j. Instead of conditioning on a single model, we then advocate the use of the posterior distribution j i (3.3) Q i (π 0 ) := P (π 0 {M : M M i }, Y n ) = P (π 0 M j, Y n )P (M j Y n ) j i P (M j Y n ) obtained by conditioning on the family of all submodels of M i. Intuitively, the proposed choice of Q i introduces the knowledge that the data-generating distribution π 0 is in M i all the while capturing remaining posterior uncertainty with respect to submodels of M i. This does not completely escape from the problem of circular reasoning, since (3.3) involves the posterior probabilities P (M j Y ) that we are trying to approximate. But as we argue below, this problem can be overcome. Note that the choice in (3.3) avoids asymptotics when π 0 is not in M i. In the examples motivating our work, there is an interplay between the behavior of the learning coefficients (and their multiplicities) and the submodels contained in the considered model. Suppose π 0 is a random probability measure in M j M i, distributed according to P (π 0 M j, Y n ). Then it typically holds that both λ i (π 0 ) and m i (π 0 ) are almost surely constant; recall again Example.1. For j i, let λ ij and m ij denote these constants and define (3.4) L ij := P (Y n ˆπ i, M i ) n λij (log n) mij 1 > 0, which can be evaluated in statistical practice. Let L (M i ) := L Q i (M i ) when Q i is chosen as in (3.3). With P (M j Y n ) = L(M j )P (M j ), we obtain from (3.) that L 1 (3.5) (M i ) = j i L(M j)p (M j ) L ij L(M j )P (M j ). Replacing L(M j ) by L (M j ) in (3.5) yields the equation system (3.6) L 1 (M i ) = j i L (M j )P (M j ) L ij L (M j )P (M j ), i I, 1 For a definition of model dimension, we assume the set Mi corresponds to a subset of Euclidean space. j i j i

7 A BAYESIAN INFORMATION CRITERION FOR SINGULAR MODELS 7 with the unknowns being the desired marginal likelihood approximations L (M i ). Clearing the denominator, we obtain the equation system [ (3.7) L (M i ) L ] ij L (M j )P (M j ) = 0, i I. j i Proposition 3.. The equation system in (3.7) has a unique solution with all unknowns L (M i ) > 0. Proof. Suppose i is a minimal element of the poset I. Then j = i is the only choice for the index j, and the equation from (3.7) reads [ L (M i ) L ] ii L (M i )P (M i ) = 0. With P (M i ) > 0, the equation has the unique positive solution L (M i ) = L ii > 0, which coincides with the exponential of the usual BIC for model M i. Consider now a non-minimal index i I. Proceeding by induction, assume that positive solutions L (M j ) have been computed for all j i, where j i if M j M i. Then L (M i ) solves the quadratic equation (3.8) L (M i ) + b i L (M i ) c i = 0 with (3.9) (3.10) b i = L ii + j i c i = j i L (M j ) P (M j) P (M i ) L ij L (M j ) P (M j) P (M i ). Since c i > 0 by the induction hypothesis, (3.8) has the unique positive solution (3.11) L (M i ) = 1 ( ) b i + b i + 4c i. Based on Proposition 3., we make the following definition in which we consider the equation system from (3.7) under the default of a uniform prior on models, that is, P (M i ) = 1/ I for i I. Definition 3.1. The singular Bayesian information criterion for model M i is sbic(m i ) = log L (M i ), where (L (M i ) : i I) is the unique solution to the equation system [ L (M i ) L ] ij L (M j ) = 0, i I, j i that has all entries positive. Remark 3.1. While we envision that the use of a uniform prior on models in Definition 3.1 is reasonable for many applications, deviations from this default can be very useful; compare, for instance, Nobile (005) who discusses priors for the number of components in mixture models. Via equation system (3.7), a non-uniform prior on models can be readily incorporated in the definition of the singular BIC.

8 8 MATHIAS DRTON AND MARTYN PLUMMER According to (3.5), sbic(m i ) is the logarithm of a weighted average of the approximations L ij, with the weights depending on the data. As in Example.1, it generally holds that λ i (π 0 ) dim(m i )/ and m i (π 0 ) 1. Assuming n 3, this implies that n λi(π0) (log n) mi(π0) 1 n dim(mi)/. Consequently, the singular BIC is of the form sbic(m i ) = log P (Y n ˆπ i, M i ) penalty(m i ), where penalty(m i ) dim(m i )/ log(n). Hence, penalty(m i ) may depend on the data Y n but is generally a milder penalty term than that in the usual BIC. Remark 3.. The computation of the singular BIC operates on the probability scale. In our implementations we compute with the logarithms of the approximations L ij and subtract suitable constants before exponentiating them. 4. Consistency As mentioned in the introduction, Schwarz s BIC from Definition.1 has been shown to be consistent in a number of settings, including many singular model selection problems. In this section, we show similar consistency results for the singular BIC from Definition 3.1. As in the previous sections, we consider a finite set of models {M i : i I} and fix a data-generating distribution π 0 i I M i. We call model M i true if π 0 M i. Otherwise, M i is false. A smallest true model M i is a true model whose strict submodels are all false, that is, j i implies that π 0 M j. Via the factor (4.1) n λi(π0) (log n) mi(π0) 1 in (3.1), a learning coefficient λ i (π 0 ) and its multiplicity m i (π 0 ) represent a measure of complexity of model M i under data-generating distribution π 0. We say that M i has smaller Bayes complexity than M j if ( λ i (π 0 ), m i (π 0 )) < ( λ j (π 0 ), m j (π 0 )). Here, is the lexicographic order on R, that is, (x 1, y 1 ) (x, y ) if x 1 < x or if x 1 = x and m 1 m. The lexicographic ordering for the pair of negated learning coefficient and multiplicity corresponds to the ordering according to the Bayes complexity factors in (4.1). In order to present a general result, we make the following assumptions about the behavior of likelihood ratios and the learning coefficients and their multiplicities: (A1) For any two true models M i and M k, the sequence of likelihood ratios P (Y n ˆπ k, M k ) P (Y n ˆπ i, M i ) is bounded in probability (i.e., uniformly tight) as n. (A) For any pair of a true model M i and a false model M k, there is a constant δ ik > 0 such that the probability that tends to 1 as n. P (Y n ˆπ k, M k ) P (Y n ˆπ i, M i ) e δ ikn

9 A BAYESIAN INFORMATION CRITERION FOR SINGULAR MODELS 9 (A3) Let M i and M k be any two true models, and let j i and l k index any two respective true submodels. Then the Bayes complexity is monotically increasing in the sense that ( λ ij, m ij ) < ( λ kl, m kl ) if i k and j l. Assumption (A3) makes the natural requirement that larger models have larger Bayes complexity. This is true for the applications we will treat later; recall also Example.1. Note that under (A3) a true model that minimizes Bayes complexity among all true models is a smallest true model. Assumption (A1) holds for problems in which one treats possibly singular submodels of exponential families and other well-behaved models, for which the likelihood ratios in (A1) typically converge to a limiting distribution (Drton, 009). This is the case, for instance, in reduced-rank regression and factor analysis as treated later. As mentioned when discussing the connection between the expansions (.5) and (.6), the sequence of likelihood ratios for mixture models is bounded in probability when the parameter space is assumed compact; without compactness the sequence need not be bounded as shown for Gaussian mixtures by Hartigan (1985). Theorem 4.1 (Consistency). Let Mî be the model selected by maximizing the singular BIC, that is, î = arg max sbic(m i ). i I Under assumptions (A1)-(A3), the probability that Mî is a true model of minimal Bayes complexity (and thus also a smallest true model) tends to 1 as n. Since we are concerned with a finite set of models {M i : i I}, the consistency result in Theorem 4.1 follows from separate comparisons. More precisely, it suffices to show that (i) the singular BIC of any true model is asymptotically larger than that of any false model and (ii) the singular BIC of a true model can be asymptotically maximal only if the model minimizes Bayes complexity among the true models. The comparisons (i) and (ii) are addressed in Propositions 4.1 and 4., respectively. Throughout, L (M i ) refers to a coordinate of the unique positive solution to (3.7). Proposition 4.1. Under assumption (A), if model M i is true and model M k is false, then the probability that sbic(m i ) > sbic(m k ) tends to 1 as n. Proof. Fix an index j i and a second index h k. Since M k is false, (A) implies that the ratio L kh /L ij converges to zero in probability as n. Using Landau notation, L kh = o p(l ij ). Since j was arbitrary, L kh = o p(l i min ), where L i min = min{l ij : j i}; note that for fixed i and varying j the approximations L ij share the likelihood term and differ only in the learning coefficients or their multiplicities. According to (3.6), L (M k ) is a weighted average of the terms L kh with h k. We obtain that (4.) L (M k ) max{l kh : h k} = o p (L i min). Similarly, L (M i ) is a weighted average of the L ij, j i, and it thus holds that (4.3) L (M i ) L i min 0.

10 10 MATHIAS DRTON AND MARTYN PLUMMER We conclude that (4.4) L (M k ) = o p (L (M i )). It follows that (4.5) P(L (M i ) > L (M k )) 1 as n, which completes the proof because sbic(m i ) = log L (M i ). Lemma 4.1. Under assumption (A), if M i is a smallest true model, then sbic(m i ) = log(l ii) + o p (1). Proof. Suppose that a b and that M a is false and M b true. Then we know from (4.4) that L (M a ) = o p (L (M b )). Using the exponentially fast decay of the ratio in (A), the arguments in the proof of Proposition 4.1 also yield that L (M a )f(n) = o p (L (M b )) for any polynomial f(n). Since L ba /L b min is a deterministic function that grows at most polynomially with n, and since L b min L (M b ) according to (4.3), we have (4.6) L bal (M a ) = o p (L (M b ) ). Now, if an index i defines a smallest true model then, by (4.6), c i = o p (L (M i ) ) and b i + L ii = o p(l (M i )). From the quadratic equation defining L (M i ), we deduce that (4.7) L (M i ) L ii L (M i ) = o p (L (M i ) ). Hence, the equation s positive solution satisfies (4.8) L (M i ) = L ii(1 + o p (1)). Taking logarithms yields the claim. Lemma 4.. Suppose the data-generating distribution π 0 is in M k, but that M k is not a smallest true model. Then under assumptions (A1)-(A3), the probability that there exists a smallest true model M i M k with sbic(m i ) > sbic(m k ) tends to 1 as n. Proof. Let I 0 I contain the indices of true models, and let I 0 min I 0 contain the indices of the smallest true models. Define L k0 = max{l kj : j k, j I 0}. Since L (M k ) is a weighted average according to (3.5), we have that (4.9) L (M k ) L k0 + max{l kj : j k, j I 0 }. Consider now an index j k with j I 0. Then there exists i I 0 min such that i j. Since M k is not a smallest true model, i k. Hence, assumptions (A1) and (A3) imply that L kj = o p(l ii ). We deduce that L k0 = o p (max{l ii : i I 0 min, i k}). It follows from Lemma 4.1, or rather (4.8) that (4.10) L k0 = o p (max{l (M i ) : i I 0 min, i k}). Moreover, using the observations from the proof of Proposition 4.1, we have that (4.11) max{l kj : j k, j I 0 } = o p (L (M i )) for any i I 0 ; recall (4.)-(4.4). Combining (4.9), (4.10) and (4.11) shows that (4.1) L (M k ) = o p (max{l (M i ) : i I 0 min, i k}).

11 A BAYESIAN INFORMATION CRITERION FOR SINGULAR MODELS 11 Consequently, the maximum on the right-hand side will exceed L (M k ) with probability tending to 1. Hence, we see that with asymptotic probability 1 there must exist a smallest true model M i M k with sbic(m i ) > sbic(m k ). Proposition 4.. Suppose the data-generating distribution π 0 is in M k, but that M k does not minimize Bayes complexity among the true models. Then under assumptions (A1)-(A3), the probability that there exists a true model M i with minimal Bayes complexity and sbic(m i ) > sbic(m k ) tends to 1 as n. Proof. Let i, j I 0 min index two models, each being a smallest true model. Suppose further that M i has strictly smaller Bayes complexity than M j. From assumption (A1), it follows that L jj = o p(l ii ). Appealing to Lemma 4.1, we obtain (4.13) L (M j ) = o p (L (M i )). Taking up (4.1), we deduce that (4.14) L (M k ) = o p (max{l (M i ) : i B 0 min }), where B 0 min I 0 min indexes the true models of minimal Bayes complexity. Hence, with asymptotic probability 1, there exists an index i B 0 min for which it holds that sbic(m i ) > sbic(m k ). 5. Bayesian behavior Under assumption (A), the marginal likelihood of a false model is with high probability exponentially smaller than that of any true model. The frequentist large-sample behavior of Bayesian model selection procedures is thus primarily dictated by the asymptotics of the marginal likelihood integrals of true models, which is given by (3.1). As pointed out in Section 3, the usual BIC with penalty depending solely on model dimension generally does not reflect the asymptotic behavior of the marginal likelihood of a true model that is singular. Consequently, as the sample size increases, the Bayes factor obtained by forming the ratio of the marginal likelihood integrals for two true models may in-/decrease at a rate that is different from the rate for an approximate Bayes factor formed by exponentiating the difference of the two respective BIC scores. In this sense, there is generally nothing Bayesian about the usual BIC from Definition.1 when a model selection problem involves singular models. In contrast, we now show that in many interesting singular settings the new singular BIC stays close to the large-sample behavior of the log-marginal likelihood. For this result, we impose a further condition on the learning coefficients and their multiplicities. This condition is met by our motivating examples including reducedrank regression. (A4) The Bayes complexity of a true model M i is nondecreasing along true submodels, that is, if j and l index two true submodels of M i, then j l implies that ( λ ij, m ij ) ( λ il, m il ). Recall that throughout we assume that the learning coefficient λ i (π 0 ) of a model M i and its multiplicity m i (π 0 ) satisfy λ i (π 0 ) = λ ij and m i (π 0 ) = m ij for almost every distribution π 0 in a submodel M j of M i. Hence, a distribution π that is chosen at random according to our assumed prior i P (π M i)p (M i ) will satisfy the condition in the next theorem.

12 1 MATHIAS DRTON AND MARTYN PLUMMER Theorem 5.1. Suppose model M i contains the data-generating distribution π 0, and (λ ih, m ih ) is the minimal Bayes complexity of any true model M h. If under assumptions (A1)-(A4) we have λ i (π 0 ) = λ ih and m i (π 0 ) = m ih, then the marginal likelihood of M i satisfies log L(M i ) = sbic(m i ) + O p (1). Proof. Let B 0 min index the true models of minimal Bayes complexity. Then, by assumption on π 0, we have (5.1) L ih = L ig = L π 0 (M i ) for any two indices g, h that satisfy g, h i and g, h B 0 min ; recall (3.1) and (3.4). From (4.4) and (4.14), we get (5.) L (M k )P (M k ) j i L (M j )P (M j ) = o p(1) if k i but k B 0 min. (For clarity, we include the prior model probabilities in the exposition even though these were set to a common constant in Definition 3.1.) We may deduce from (5.1) and (5.) that (5.3) (5.4) (5.5) L (M i ) = L L (M k )P (M k ) ik k i j i L (M j )P (M j ) = L π 0 (M i ) L (M g )P (M g ) g i, g B 0 min j i L (M j )P (M j ) + o p(1) = L π 0 (M i ) (1 + o p (1)). The claim follows by taking logarithms and appealing to (.6). 6. Simulations for reduced-rank regression and factor analysis In this section we present numerical experiments comparing our sbic from Definition 3.1 to the usual BIC with penalty based on model dimension. First, we demonstrate that sbic can achieve superior frequentist model selection behavior. Next, we consider approximate posterior model probabilities obtained by normalizing exponentiated BIC values. We illustrate in examples that the use of sbic allows for more posterior mass being assigned to larger models, which seems more in line with fully Bayesian procedures for model determination (recall Theorem 5.1) Rank selection. We take up the setting of reduced-rank regression from Example.1 and Aoyagi and Watanabe (005). We consider a scenario with an N = 10 dimensional response and M = 15 covariates. We randomly generate an N M matrix of regression coefficients π of fixed rank 4. More precisely, we fix the signal strength by fixing the non-zero singular values of π to be 5/4, 1, 3/4 and 1/. The matrix π is then obtained by pre- and postmultiplying the diagonal matrix of singular values with orthogonal matrices drawn from uniform distributions. Given π, we generate n independent and identically distributed normal random vectors according to the reduced-rank regression model. From these data, rank estimates are obtained by maximizing Schwarz s BIC or the new sbic, respectively. For each value of n, we run 500 simulations with varying π. The results of the simulations are shown in Figure 6.1, in which the new sbic is seen to have clearly superior behavior in finite samples. For instance, with a sample

13 A BAYESIAN INFORMATION CRITERION FOR SINGULAR MODELS 13 n = 50 n = 100 n = Estimated rank Estimated rank Estimated rank n = 00 n = 50 n = Estimated rank Estimated rank Estimated rank n = 350 n = 400 n = Estimated rank Estimated rank Estimated rank Figure 6.1. Frequencies of rank estimates in reduced-rank regression using Schwarz s BIC (grey) and sbic (black). Results from 500 simulations with parameter matrices of true rank 4. size of n = 150, sbic identifies the true rank 4 in the majority of cases whereas the usual BIC selects a rank of or 3 in virtually all cases. Consistency of sbic appears to kick in at around n = 00 whereas the usual BIC needs about n = 400 to n = 500 data points before rank 4 is selected in the clear majority of cases. In Figure 6. we plot the entropies of the relative model selection frequencies that underlie Figure 6.1 against the sample size. The smoothed curves show that the consistency of sbic in selecting the true rank 4 goes hand in hand with the model selection frequencies concentrating on a single rank. This is not the case for the standard BIC, which at around sample size n = 50 selects the incorrect rank 3 in the vast majority of cases before it then begins to select rank 4. We have seen similar improvements when varying the size of the matrices or other aspects of the simulations. 6.. Factor analysis. Lopes and West (004, 6.3) fit Bayesian factor analysis models to data Y n concerning changes in the exchange rates of 6 currencies relative

14 14 MATHIAS DRTON AND MARTYN PLUMMER Entropy BIC sbic Sample size Figure 6.. Entropies of simulated model selection frequencies in reduced-rank regression, with loess curve fits. Results from 500 simulations with parameter matrices of true rank 4. to the British pound. The sample size is n = 143. The number of factors is restricted to be at most 3 so as to not overparametrize the 6 6 covariance matrix. Let M i be the model with i factors. In their Tables 3 and 5, Lopes and West (004) report the following two sets of posterior model probabilities obtained from reversible jump Markov chain Monte Carlo algorithms: (6.1) P (M 1 Y n ) = 0.00, P (M Y n ) = 0.88, P (M 3 Y n ) = 0.1 and (6.) P (M 1 Y n ) = 0.00, P (M Y n ) = 0.98, P (M 3 Y n ) = 0.0. The two cases are based on slightly different priors for the parameters of each model. We consider these same data and compute Schwarz s BIC as well as our singular BIC. For the singular BIC it is natural to consider the model M 0 that postulates independence of the 6 considered changes in exchange rates. Based on ongoing work of the first author and collaborators, we use the following learning coefficients λ ij for sbic: j = 0 j = 1 j = j = 3 i = 0 3 i = i = 6 9 i = with all multiplicities m ij = 1. These learning coefficients do not do not include the contribution of 6/ = 3 from the means of the six variables

15 A BAYESIAN INFORMATION CRITERION FOR SINGULAR MODELS 15 Exponentiating and renormalizing either set of BIC scores, we obtain the following approximate posterior model probabilities: (6.3) P (M 0 Y n ) P (M 1 Y n ) P (M Y n ) P (M 3 Y n ) BIC sbic Comparing (6.3) to (6.1) and (6.), we see that the approximation given by sbic gives results that are closer to the Monte Carlo approximations than those from the standard BIC which leads to overconfidence in model M. Of course, this assessment is necessarily subjective as it pertains to a comparison with two particular priors P (π i M i ) in each model. To further explore the connection between the information criteria and a fully Bayesian procedures, we subsampled the considered exchange rate data to create 10 data sets for each sample size n {5, 50, 75, 100}. For each data set we ran the Markov chain Monte Carlo algorithms of Lopes and West (004), focusing on the prior underlying (6.). In Figure 6.3 we present boxplots of the four posterior model probabilities. When comparing the spread in the approximate posterior probabilities, sbic gives a far better agreement with the fully Bayesian procedure than the standard BIC. For the considered data, the model uncertainty mostly concerns the decision between two and three factors and can be summarized by the Bayes factor for this model comparison. In Figure 6.4, we plot the log-bayes factors obtained from the Markov chain Monte Carlo procedure against those computed via the information criteria. The results from sbic are seen to be rather close to Bayesian; the filled points in the scatter plot cluster around the 45 degree line. The plot also illustrates one more time that BIC is overly certain about the number of factors being two. We note that the scatter plots for n = 5 and n = 100 had two and one dataset dropped, respectively, since those had at least one estimate of a marginal likelihood equal to zero Gaussian mixtures. Our final experiments pertain to univariate Gaussian mixtures. For Gaussian mixture models in which the variances of the component distributions are known and equal to a common value, it has been shown that (.5) and (.6) hold; as always in this theory a compactness assumption is made about the parameter space. In addition, the learning coefficients have been determined by Aoyagi (010a). However, in statistical practice, the variances are typically unknown and often the model with unequal variances is of interest. Despite the disconnection with the existing theory, we will treat Gaussian mixtures with unknown and unequal variances, assuming that the result in (.6) indeed applies. We are then facing a problem in which the learning coefficients have not been worked out. However, it is possible to give rather simple bounds and our point here is that these bounds are useful improvements over a count of all parameters in the model. Let M i be the Gaussian mixture model with i components, so π M i if i π = α h N (µ h, σh) h=1 for choices of means µ h R, variances σ h > 0 and mixture weights α h 0 that sum to one. Consider now a data-generating distribution π 0 M j M i. In order to represent π 0 as an element of M i, we may set i j of the mixture weights to

16 16 MATHIAS DRTON AND MARTYN PLUMMER n = 5 n = Bayes BIC sbic 1.0 Bayes BIC sbic Number of factors Number of factors n = 75 n = Bayes BIC sbic 1.0 Bayes BIC sbic Number of factors Number of factors Figure 6.3. Boxplots of posterior model probabilities in a factor analysis of exchange rate data under subsampling to size n {5, 50, 75, 100}: Results from a Markov chain Monte Carlo algorithm ( Bayes ), Schwarz s BIC and the new sbic. zero, which leaves i j of the means and i j of the variances parameters free. This fact leads to the bound (6.4) λ ij 1 [(i 1) + j] ; compare Section 7.3 in Watanabe (009). Note that for j < i this bound is strictly smaller than dim(m i )/ = (3i 1)/. We consider a familiar example, namely, the galaxies data set that was analyzed by many authors; see the review in Aitkin (001). We use the R package mclust (Fraley et al., 01) to fit the mixture models and then compute sbic treating the bounds from (6.4) as values of the learning coefficients λ ij. As is well-known, the likelihood function is unbounded when the variances are not bounded away

17 A BAYESIAN INFORMATION CRITERION FOR SINGULAR MODELS 17 n= 5 n= 50 BIC/sBIC BIC sbic BIC/sBIC BIC sbic Bayes Bayes n= 75 n= 100 BIC/sBIC BIC sbic BIC/sBIC BIC sbic Bayes Bayes Figure 6.4. Scatter plot of log-bayes factors comparing the results of a Markov chain Monte Carlo algorithm to BIC and sbic in a factor analysis of exchange rate data under subsampling to size n {5, 50, 75, 100}. from zero. The results we present are based on the best local maxima of the likelihood function found throughout repeated runs of the EM algorithm implemented in mclust. For each model, we ran the EM 5000 times with random initializations, which were created by drawing, independently for each data point, a vector of cluster membership probabilities from the uniform distribution on the relevant probability simplex. Figure 6.5 shows the values of BIC and sbic we obtained. These are converted into posterior model probabilities in Figure 6.6, where we also show posterior probabilities from the fully Bayesian analysis of Richardson and Green (1997). The conclusions are similar to those in the previous examples. As can be expected from theory, the standard BIC leads to selection of a smaller number of components than sbic, namely, 3 versus 6 components. The approximate posterior distribution based on BIC places essentially all mass on 3-5 components whereas that based on sbic concentrates on 5-8 components and is more similar to the results of the fully Bayesian analysis.

18 18 MATHIAS DRTON AND MARTYN PLUMMER Galaxies data: Mixture of Gaussians (unequal variances) 30 5 BIC sbic 0 BIC BIC Number of components Figure 6.5. Galaxies data: Values of BIC and sbic. Galaxies data: Mixture of Gaussians (unequal variances) MCMC BIC sbic Number of components Figure 6.6. Galaxies data: Posterior model probabilities from BIC, sbic and MCMC as per Richardson and Green (1997). 7. Conclusion In this paper we introduced a new Bayesian information criterion for singular statistical models. The new criterion, abbreviated sbic, is free of Monte Carlo computation and coincides with the widely-used criterion of Schwarz when the model is regular. Moreover, the criterion is consistent and maintains a rigorous connection to Bayesian approaches even in singular settings. This latter behavior is made possible by exploiting theoretical knowledge about the learning coefficients that capture the large-sample behavior of the concerned marginal likelihood integrals. For problems that involve a moderate number of models and that are amenable to an exhaustive model search, the computational effort going into the calculation

19 A BAYESIAN INFORMATION CRITERION FOR SINGULAR MODELS 19 of sbic scores is comparable to that for the ordinary BIC as the effort is typically dominated by the process of fitting all considered models to the available data. Computational strategies and approximations to sbic for problems with a large set of models constitute an interesting topic for future work. When treating problems with a large number of models it can be beneficial to adopt a non-uniform prior distribution on models; compare e.g. the work on regression models by Chen and Chen (008) and Scott and Berger (010), and the work on graphical models by Foygel and Drton (010) and Gao et al. (01). As mentioned in Remark 3.1, it is straightforward to incorporate prior model probabilities into the definition of sbic. Incorporating positive prior probabilities has no effect on the asymptotic results from Sections 4 and 5, as they pertain to the classical scenario of a fixed number of models and increasing sample size. To our knowledge, sbic is the first method to make use of information about the learning coefficients of singular models. It is this use of theoretical knowledge that allows one to avoid Monte Carlo computations. This said, the reliance on mathematical information is also what limits the applicability of sbic. As mentioned earlier a number of statistical models have been studied with regards to their learning coefficients. The new sbic provides strong positive motivation for further theoretical advances. For scenarios in which the computation of learning coefficients remains intractable the recent work of Watanabe (013) suggests an interesting new Markov chain Monte Carlo-based alternative for approximate Bayesian model determination. Acknowledgments This collaboration started at a workshop at the American Institute of Mathematics, and we would like to thank the participants of the workshop for helpful discussions. Particular thanks go to Vishesh Karwa and Dennis Leung for help with some of the numerical work. Mathias Drton was supported by the NSF (Grant No. DMS and DMS ) and by an Alfred P. Sloan Fellowship. References Aitkin, M. (001) Likelihood and Bayesian analysis of mixtures. Statistical Modelling, 1, Akaike, H. (1974) A new look at the statistical model identification. IEEE Trans. Automat. Control, AC-19, Aoyagi, M. (009) Log canonical threshold of Vandermonde matrix type singularities and generalization error of a three-layered neural network in Bayesian estimation. Int. J. Pure Appl. Math., 5, Aoyagi, M. (010a) A Bayesian learning coefficient of generalization error and Vandermonde matrix-type singularities. Comm. Statist. Theory Methods, 39, Aoyagi, M. (010b) Stochastic complexity and generalization error of a restricted Boltzmann machine in Bayesian estimation. J. Mach. Learn. Res., 11, Aoyagi, M. and Watanabe, S. (005) Stochastic complexities of reduced rank regression in Bayesian estimation. Neural Networks, 18, Arnol d, V. I., Guseĭn-Zade, S. M. and Varchenko, A. N. (1988) Singularities of differentiable maps. Vol. II, vol. 83 of Monographs in Mathematics. Boston, MA: Birkhäuser.

20 0 MATHIAS DRTON AND MARTYN PLUMMER Azaïs, J.-M., Gassiat, É. and Mercadier, C. (006) Asymptotic distribution and local power of the log-likelihood ratio test for mixtures: bounded and unbounded cases. Bernoulli, 1, Azaïs, J.-M., Gassiat, É. and Mercadier, C. (009) The likelihood ratio test for general mixture models with or without structural parameter. ESAIM Probab. Stat., 13, Chen, J. and Chen, Z. (008) Extended Bayesian information criterion for model selection with large model space. Biometrika, 95, Cheng, X. and Phillips, P. C. (01) Cointegrating rank selection in models with time-varying variance. Journal of Econometrics, 169, DiCiccio, T. J., Kass, R. E., Raftery, A. and Wasserman, L. (1997) Computing Bayes factors by combining simulation and asymptotic approximations. J. Amer. Statist. Assoc., 9, Drton, M. (009) Likelihood ratio tests and singularities. Ann. Statist., 37, Drton, M., Sturmfels, B. and Sullivant, S. (009) Lectures on algebraic statistics, vol. 39 of Oberwolfach Seminars. Basel: Birkhäuser Verlag. Foygel, R. and Drton, M. (010) Extended Bayesian information criteria for Gaussian graphical models. Adv. Neural Inf. Process. Syst., 3, Fraley, C., Raftery, A. E., Murphy, T. B. and Scrucca, L. (01) MCLUST version 4 for R: normal mixture modeling for model-based clustering, classification, and density estimation. Tech. Rep. 597, University of Washington, Department of Statistics. Friel, N. and Wyse, J. (01) Estimating the evidence a review. Stat. Neerl., 66, Gao, X., Pu, D. Q., Wu, Y. and Xu, H. (01) Tuning parameter selection for penalized likelihood estimation of Gaussian graphical model. Statist. Sinica,, Hartigan, J. A. (1985) A failure of likelihood asymptotics for normal mixtures. In Proceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer (eds. L. M. L. Cam and R. A. Olshen), vol. II, Wadsworth. Hastie, T., Tibshirani, R. and Friedman, J. (009) The elements of statistical learning. Springer Series in Statistics. New York: Springer, second edn. Data mining, inference, and prediction. Haughton, D. (1989) Size of the error in the choice of a model to fit data from an exponential family. Sankhyā Ser. A, 51, Haughton, D. M. A. (1988) On the choice of a model to fit data from an exponential family. Ann. Statist., 16, Kass, R. E. and Wasserman, L. (1995) A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. J. Amer. Statist. Assoc., 90, Keribin, C. (000) Consistent estimation of the order of mixture models. Sankhyā Ser. A, 6, Lin, S. (011) Asymptotic approximation of marginal likelihood integrals. arxiv: v. Lopes, H. F. and West, M. (004) Bayesian model assessment in factor analysis. Statist. Sinica, 14,

arxiv: v3 [stat.me] 23 Mar 2016

arxiv: v3 [stat.me] 23 Mar 2016 A BAYESIAN INFORMATION CRITERION FOR SINGULAR MODELS MATHIAS DRTON AND MARTYN PLUMMER arxiv:1309.0911v3 [stat.me] 3 Mar 016 Abstract. We consider approximate Bayesian model choice for model selection problems