arxiv: v1 [stat.me] 4 Sep 2013

Size: px
Start display at page:

Download "arxiv: v1 [stat.me] 4 Sep 2013"

Transcription

1 A BAYESIAN INFORMATION CRITERION FOR SINGULAR MODELS MATHIAS DRTON AND MARTYN PLUMMER arxiv: v1 [stat.me] 4 Sep 013 Abstract. We consider approximate Bayesian model choice for model selection problems that involve models whose Fisher-information matrices may fail to be invertible along other competing submodels. Such singular models do not obey the regularity conditions underlying the derivation of Schwarz s Bayesian information criterion (BIC) and the penalty structure in BIC generally does not reflect the frequentist large-sample behavior of their marginal likelihood. While large-sample theory for the marginal likelihood of singular models has been developed recently, the resulting approximations depend on the true parameter value and lead to a paradox of circular reasoning. Guided by examples such as determining the number of components of mixture models, the number of factors in latent factor models or the rank in reduced-rank regression, we propose a resolution to this paradox and give a practical extension of BIC for singular model selection problems. 1. Introduction Information criteria are classical tools for model selection. At a high-level, they fall into two categories (Yang, 005). On one hand, there are criteria that target good predictive behavior of the selected model; the information criterion of Akaike (1974) and cross-validation based scores are examples. The Bayesian information criterion (BIC) of Schwarz (1978), on the other hand, draws motivation from Bayesian approaches. From the frequentist perspective, it has been shown in a number of settings that the BIC is consistent. In other words, under optimization of BIC the probability of selecting a fixed most parsimonious true model tends to one as the sample size tends to infinity (e.g., Nishii, 1984, Haughton, 1988, 1989). From a Bayesian point of view, the BIC yields rather crude but computationally inexpensive approximations to otherwise difficult to calculate posterior model probabilities in Bayesian model selection/averaging; see Kass and Wasserman (1995), Raftery (1995), DiCiccio et al. (1997) or Hastie et al. (009, Chap. 7.7). In this paper, we are concerned with Bayesian information criteria in the context of singular model selection problems, that is, problems that involve models with Fisher-information matrices that may fail to be invertible. For example, due to the break-down of parameter identifiability, the Fisher-information matrix of a mixture model with three component distributions is singular at a distribution that can be obtained by mixing only two components. This clearly presents a fundamental challenge for selection of the number of components. Other important examples of this type include determining the rank in reduced-rank regression, the number of Key words and phrases. Bayesian information criterion, factor analysis, mixture model, model selection, reduced-rank regression, singular learning theory, Schwarz information criterion. 1

2 MATHIAS DRTON AND MARTYN PLUMMER factors in factor analysis or the number of states in latent class or hidden Markov models. More generally, all the classical hidden/latent variable models are singular. As demonstrated by Steele and Raftery (010) for Gaussian mixture models or Lopes and West (004) for factor analysis, BIC can be a state-of-the-art method for singular model selection. However, while BIC is known to be consistent in these and other singular settings (Keribin, 000, Drton et al., 009, Chap. 5.1), the technical arguments in its Bayesian-inspired derivation do not apply. In a nutshell, when the Fisher-information is singular, the log-likelihood function does not admit a largesample approximation by a quadratic form. Consequently, the BIC does not reflect the frequentist large-sample behavior of the Bayesian marginal likelihood of singular models (Watanabe, 009). In contrast, this paper develops a generalization of BIC that is not only consistent but also maintains a rigorous connection to Bayesian model choice in singular settings. The generalization is honest in the sense that the new criterion coincides with Schwarz s when the model is regular. The new criterion, which we abbreviate to sbic, is presented in Section 3. It relies on theoretical knowledge about the large-sample behavior of the marginal likelihood of the considered models. Section reviews the necessary background on this theory as developed by Watanabe (009). Consistency of sbic is shown in Section 4, and the connection to Bayesian methods is developed in Section 5. In the numerical examples in Section 6, sbic achieves improved statistical inferences while keeping computational cost low. Concluding remarks are given in Section 7.. Background Let Y n = (Y n1,..., Y nn ) denote a sample of n independent and identically distributed observations, and let {M i : i I} be a finite set of candidate models for the distribution of these observations. For a Bayesian treatment, suppose that we have positive prior probabilities P (M i ) for the models and that, in each model M i, a prior distribution P (π i M i ) is specified for the probability distributions π i M i. Write P (Y n π i, M i ) for the likelihood of Y n under data-generating distribution π i from model M i. Let (.1) L(M i ) := P (Y n M i ) = P (Y n π i, M i ) dp (π i M i ). M i be the marginal likelihood of model M i. Bayesian model choice is then based on the posterior model probabilities P (M i Y n ) P (M i )L(M i ), i I. The probabilities P (M i Y n ) can be approximated by various Monte Carlo procedures, see Friel and Wyse (01) for a recent review, but practitioners also often turn to computationally inexpensive proxies suggested by large-sample theory. These proxies are based on the asymptotic properties of the sequence of random variables L(M i ) obtained when Y n is drawn from a data-generating distribution π 0 M i, and we let the sample size n grow. In practice, a prior distribution P (π i M i ) is typically specified by parametrizing M i and placing a distribution on the involved parameters. So assume that (.) M i = { π i (ω i ) : ω i Ω i } with d i -dimensional parameter space Ω i R di, and that P (π i M i ) is the transformation of a distribution P (ω i M i ) on Ω i under the map ω i π i (ω i ). The

3 A BAYESIAN INFORMATION CRITERION FOR SINGULAR MODELS 3 marginal likelihood then becomes the d i -dimensional integral (.3) L(M i ) = P (Y n π i (ω i ), M i ) dp (ω i M i ). Ω i The observation of Schwarz and other subsequent work is that, under suitable technical conditions on the model M i, the parametrization ω i π i (ω i ) and the prior distribution P (ω i M i ), it holds for all π 0 M i that (.4) log L(M i ) = log P (Y n ˆπ i, M i ) d i log(n) + O p(1). Here, P (Y n ˆπ i, M i ) is the maximum of the likelihood function, and O p (1) stands for a remainder that is bounded in probability, i.e., uniformly tight as the sample size n grows. The first two terms on the right-hand side of (.4) are functions of the data Y n and the model M i alone and may thus be used as a model score or proxy for the logarithm of the marginal likelihood. Definition.1. The Bayesian or Schwarz s information criterion for model M i is BIC(M i ) = log P (Y n ˆπ i, M i ) d i log(n). Briefly put, the large-sample behavior from (.4) relies on the following properties of regular problems. First, with high probability, the integrand in (.3) is negligibly small outside a small neighborhood of the maximum likelihood estimator of ω i. Second, in such a neighborhood, the log-likelihood function log P (Y n π i (ω i ), M i ) can be approximated by a negative definite quadratic form, while a smooth prior P (ω i M i ) is approximately constant. The integral in (.3) may thus be approximated by a Gaussian integral, whose normalizing constant leads to (.4). We remark that this approach also allows for estimation of the remainder term in (.4), giving a Laplace approximation with error O p (n 1/ ); compare e.g., Tierney and Kadane (1986), Haughton (1988), Kass and Wasserman (1995), Wasserman (000). A large-sample quadratic approximation to the log-likelihood function is not possible, however, when the Fisher-information matrix is singular. Consequently, the classical theory alluded to above does not apply to singular models. Indeed, (.4) is generally false in singular models. Nevertheless, asymptotic theory for the marginal likelihood of singular models has been developed over the last decade, culminating in the monograph of Watanabe (009). Theorem 6.7 in Watanabe (009) shows that a wide variety of singular models have the property that, for Y n drawn from π 0 M i, (.5) log L(M i ) = log P (Y n π 0, M i ) λ i (π 0 ) log(n) + [ m i (π 0 ) 1 ] log log(n) + O p (1); see also the introduction to the topic in Drton et al. (009, Chap. 5.1). If the sequence of likelihood ratios P (Y n ˆπ i, M i )/P (Y n π 0, M i ) is bounded in probability, then we also have that (.6) log L(M i ) = log P (Y n ˆπ i, M i ) λ i (π 0 ) log(n) + [ m i (π 0 ) 1 ] log log(n) + O p (1).

4 4 MATHIAS DRTON AND MARTYN PLUMMER For singular submodels of exponential families such as the reduced-rank regression and factor analysis models treated later, the likelihood ratios converge in distribution and are thus bounded in probability (Drton, 009). For more complicated models, such as mixture models, likelihood ratios can often be shown to converge in distribution under compactness assumptions on the parameter space; compare e.g. Azaïs et al. (006, 009). Such compactness assumptions also appear in the derivation of (.5). We will not concern ourselves further with the details of these issues as the main purpose of this paper is to describe a statistical method that can leverage mathematical information in the form of equation (.6). The quantity λ i (π 0 ) is known as the learning coefficient (or also real log-canonical threshold or stochastic complexity) and m i (π 0 ) is its multiplicity. In the analytic settings considered in Watanabe (009), it holds that λ i (π 0 ) is a rational number in [0, d i /] and m i (π 0 ) is an integer in {1,..., d i }. We remark that in singular models it is very difficult to estimate the O p (1) remainder term in (.6). We are not aware of any successful work on higher-order approximations in statistically relevant settings. Example.1. Reduced-rank regression is multivariate linear regression subject to a rank constraint on the matrix of regression coefficients (Reinsel and Velu, 1998). Keeping only with the most essential structure, suppose we observe n independent copies of a partitioned zero-mean Gaussian random vector Y = (Y 1, Y ), with Y 1 R N and Y R M, and where the covariance matrix of Y and the conditional covariance matrix of Y 1 given Y are both the identity matrix. The reduced-rank regression model M i associated to an integer i 0 postulates that the N M matrix π in the conditional expectation E[Y 1 Y ] = πy has rank at most i. In a Bayesian treatment, consider the parametrization π = ω ω 1, with absolutely continuous prior distributions for ω R N i and ω 1 R i M. Let the true datagenerating distribution be given by the matrix π 0 of rank j i. Aoyagi and Watanabe (005) derived the learning coefficients λ i (π 0 ) and their multiplicities m i (π 0 ) for this setup. In particular, λ i (π 0 ) and m i (π 0 ) depend on π 0 only through the true rank j. For a concrete instance, take N = 5 and M = 3. Then the multiplicity m i (π 0 ) = 1 unless i = 3 and j = 0 in which case m i (π 0 ) =. The values of λ i (π 0 ) are: j = 0 j = 1 j = j = 3 i = i = 1 9 i = 3 i = Note that the table entries for j = i are equal to dim(m i )/, where dim(m i ) = i(n + M i) is the dimension of M i, which can be identified with the set of N M matrices of rank at most i. The dimension is also the maximal rank of the Jacobian of the map (ω 1, ω ) ω ω 1. The singularities of M i correspond to the points where the Jacobian fails to have maximal rank. These have rank(ω ω 1 ) < i. The fact that the singularities correspond to a drop in rank presents a challenge for model selection, which here amounts to selection of an appropriate rank. Simulation studies on rank selection have shown that the standard BIC, with d i = dim(m i ) in Definition.1, has a tendency to select overly small ranks; for a 15

5 A BAYESIAN INFORMATION CRITERION FOR SINGULAR MODELS 5 recent example see Cheng and Phillips (01). The quoted values of λ i (π 0 ) give a theoretical explanation as the use of dimension in BIC leads to overpenalization of models that contain the true data-generating distribution but are not minimal in that regard. Determining learning coefficients can be a challenging problem, but progress has been made. For some of the examples that have been treated, we refer the reader to Aoyagi (010a,b, 009), Watanabe and Amari (003), Watanabe and Watanabe (007), Rusakov and Geiger (005), Yamazaki and Watanabe (003, 005, 004), and Zwiernik (011). The use of techniques from computational algebra and combinatorics is emphasized in Lin (011); see also Arnol d et al. (1988), Vasil ev (1979). The mentioned theoretical progress, however, does not readily translate into practical statistical methodology because one faces the obstacle that the learning coefficients depend on the unknown data-generating distribution π 0, as indicated in our notation in (.6). For instance, for the problem of selecting the rank in reduced-rank regression (Example.1), the Bayesian measure of model complexity that is given by the learning coefficient and its multiplicity depends on the rank we wish to determine in the first place. It is for this reason that there is currently no statistical method that takes advantage of theoretical knowledge about learning coefficients. In the remainder of this paper, we propose a solution for how to overcome the problem of circular reasoning and give a practical extension of the Bayesian information criterion to singular models. 3. New Bayesian information criterion for singular models If the true data-generating distribution π 0 was known, then (.6) would suggest replacing the marginal likelihood L(M i ) by (3.1) L π 0 (M i ) := P (Y n ˆπ i, M i ) n λi(π0) (log n) mi(π0) 1. The data-generating distribution being unknown, however, we propose to follow the standard Bayesian approach and to assign a probability distribution Q i to the distributions in model M i. We then eliminate the unknown distribution π 0 by marginalization. In other words, we compute an approximation to L(M i ) as (3.) L Q i (M i ) := L π 0 (M i ) dq i (π 0 ). M i The crux of the matter now becomes choosing an appropriate measure Q i. Before discussing particular choices for Q i, we stress that any choice for Q i reduces to Schwarz s criterion in the regular case. Proposition 3.1. If the model M i is regular, then it holds for all probability measures Q i on M i that L Q i (M i ) = e BIC(Mi). Proof. In our context, a regular model with d i parameters satisfies λ i (π 0 ) = d i / and m i (π 0 ) = 1 for all data-generating distributions π 0 M i. Hence, the integrand in (3.) is constant and equal to L π 0 (M i ) = e BIC(Mi).

6 6 MATHIAS DRTON AND MARTYN PLUMMER Returning to the singular case, one possible candidate for Q i is the posterior distribution P (π 0 M i, Y n ). Under this distribution, however, the singular models encountered in practice have the learning coefficient λ i (π 0 ) almost surely equal to dim(m i )/ with multiplicity m i (π 0 ) = 1; recall Example.1. 1 We obtain that log L Q i (M i ) = log P (Y n ˆπ i, M i ) dim(m i) log(n), which is the usual BIC, albeit with the possibility that dim(m i ) < d i, where d i is the dimension of the parameter space Ω i when M i is presented as in (.). From a pragmatic point of view, this choice of Q i is not attractive as it merely recovers the adjustment from d i to dim(m i ) that is standard practice when applying Schwarz s BIC to singular models. More importantly, however, averaging with respect to the posterior distribution P (π 0 M i, Y n ) involves conditioning on the single model M i, which clearly ignores the uncertainty regarding the choice of model that is inherent in model selection problems. In most practical problems, the finite set of models {M i : i I} has interesting structure with respect to the partial order given by inclusion. For notational convenience, we define the poset structure on the index set I and write i j when M i M j. Instead of conditioning on a single model, we then advocate the use of the posterior distribution j i (3.3) Q i (π 0 ) := P (π 0 {M : M M i }, Y n ) = P (π 0 M j, Y n )P (M j Y n ) j i P (M j Y n ) obtained by conditioning on the family of all submodels of M i. Intuitively, the proposed choice of Q i introduces the knowledge that the data-generating distribution π 0 is in M i all the while capturing remaining posterior uncertainty with respect to submodels of M i. This does not completely escape from the problem of circular reasoning, since (3.3) involves the posterior probabilities P (M j Y ) that we are trying to approximate. But as we argue below, this problem can be overcome. Note that the choice in (3.3) avoids asymptotics when π 0 is not in M i. In the examples motivating our work, there is an interplay between the behavior of the learning coefficients (and their multiplicities) and the submodels contained in the considered model. Suppose π 0 is a random probability measure in M j M i, distributed according to P (π 0 M j, Y n ). Then it typically holds that both λ i (π 0 ) and m i (π 0 ) are almost surely constant; recall again Example.1. For j i, let λ ij and m ij denote these constants and define (3.4) L ij := P (Y n ˆπ i, M i ) n λij (log n) mij 1 > 0, which can be evaluated in statistical practice. Let L (M i ) := L Q i (M i ) when Q i is chosen as in (3.3). With P (M j Y n ) = L(M j )P (M j ), we obtain from (3.) that L 1 (3.5) (M i ) = j i L(M j)p (M j ) L ij L(M j )P (M j ). Replacing L(M j ) by L (M j ) in (3.5) yields the equation system (3.6) L 1 (M i ) = j i L (M j )P (M j ) L ij L (M j )P (M j ), i I, 1 For a definition of model dimension, we assume the set Mi corresponds to a subset of Euclidean space. j i j i

7 A BAYESIAN INFORMATION CRITERION FOR SINGULAR MODELS 7 with the unknowns being the desired marginal likelihood approximations L (M i ). Clearing the denominator, we obtain the equation system [ (3.7) L (M i ) L ] ij L (M j )P (M j ) = 0, i I. j i Proposition 3.. The equation system in (3.7) has a unique solution with all unknowns L (M i ) > 0. Proof. Suppose i is a minimal element of the poset I. Then j = i is the only choice for the index j, and the equation from (3.7) reads [ L (M i ) L ] ii L (M i )P (M i ) = 0. With P (M i ) > 0, the equation has the unique positive solution L (M i ) = L ii > 0, which coincides with the exponential of the usual BIC for model M i. Consider now a non-minimal index i I. Proceeding by induction, assume that positive solutions L (M j ) have been computed for all j i, where j i if M j M i. Then L (M i ) solves the quadratic equation (3.8) L (M i ) + b i L (M i ) c i = 0 with (3.9) (3.10) b i = L ii + j i c i = j i L (M j ) P (M j) P (M i ) L ij L (M j ) P (M j) P (M i ). Since c i > 0 by the induction hypothesis, (3.8) has the unique positive solution (3.11) L (M i ) = 1 ( ) b i + b i + 4c i. Based on Proposition 3., we make the following definition in which we consider the equation system from (3.7) under the default of a uniform prior on models, that is, P (M i ) = 1/ I for i I. Definition 3.1. The singular Bayesian information criterion for model M i is sbic(m i ) = log L (M i ), where (L (M i ) : i I) is the unique solution to the equation system [ L (M i ) L ] ij L (M j ) = 0, i I, j i that has all entries positive. Remark 3.1. While we envision that the use of a uniform prior on models in Definition 3.1 is reasonable for many applications, deviations from this default can be very useful; compare, for instance, Nobile (005) who discusses priors for the number of components in mixture models. Via equation system (3.7), a non-uniform prior on models can be readily incorporated in the definition of the singular BIC.

8 8 MATHIAS DRTON AND MARTYN PLUMMER According to (3.5), sbic(m i ) is the logarithm of a weighted average of the approximations L ij, with the weights depending on the data. As in Example.1, it generally holds that λ i (π 0 ) dim(m i )/ and m i (π 0 ) 1. Assuming n 3, this implies that n λi(π0) (log n) mi(π0) 1 n dim(mi)/. Consequently, the singular BIC is of the form sbic(m i ) = log P (Y n ˆπ i, M i ) penalty(m i ), where penalty(m i ) dim(m i )/ log(n). Hence, penalty(m i ) may depend on the data Y n but is generally a milder penalty term than that in the usual BIC. Remark 3.. The computation of the singular BIC operates on the probability scale. In our implementations we compute with the logarithms of the approximations L ij and subtract suitable constants before exponentiating them. 4. Consistency As mentioned in the introduction, Schwarz s BIC from Definition.1 has been shown to be consistent in a number of settings, including many singular model selection problems. In this section, we show similar consistency results for the singular BIC from Definition 3.1. As in the previous sections, we consider a finite set of models {M i : i I} and fix a data-generating distribution π 0 i I M i. We call model M i true if π 0 M i. Otherwise, M i is false. A smallest true model M i is a true model whose strict submodels are all false, that is, j i implies that π 0 M j. Via the factor (4.1) n λi(π0) (log n) mi(π0) 1 in (3.1), a learning coefficient λ i (π 0 ) and its multiplicity m i (π 0 ) represent a measure of complexity of model M i under data-generating distribution π 0. We say that M i has smaller Bayes complexity than M j if ( λ i (π 0 ), m i (π 0 )) < ( λ j (π 0 ), m j (π 0 )). Here, is the lexicographic order on R, that is, (x 1, y 1 ) (x, y ) if x 1 < x or if x 1 = x and m 1 m. The lexicographic ordering for the pair of negated learning coefficient and multiplicity corresponds to the ordering according to the Bayes complexity factors in (4.1). In order to present a general result, we make the following assumptions about the behavior of likelihood ratios and the learning coefficients and their multiplicities: (A1) For any two true models M i and M k, the sequence of likelihood ratios P (Y n ˆπ k, M k ) P (Y n ˆπ i, M i ) is bounded in probability (i.e., uniformly tight) as n. (A) For any pair of a true model M i and a false model M k, there is a constant δ ik > 0 such that the probability that tends to 1 as n. P (Y n ˆπ k, M k ) P (Y n ˆπ i, M i ) e δ ikn

9 A BAYESIAN INFORMATION CRITERION FOR SINGULAR MODELS 9 (A3) Let M i and M k be any two true models, and let j i and l k index any two respective true submodels. Then the Bayes complexity is monotically increasing in the sense that ( λ ij, m ij ) < ( λ kl, m kl ) if i k and j l. Assumption (A3) makes the natural requirement that larger models have larger Bayes complexity. This is true for the applications we will treat later; recall also Example.1. Note that under (A3) a true model that minimizes Bayes complexity among all true models is a smallest true model. Assumption (A1) holds for problems in which one treats possibly singular submodels of exponential families and other well-behaved models, for which the likelihood ratios in (A1) typically converge to a limiting distribution (Drton, 009). This is the case, for instance, in reduced-rank regression and factor analysis as treated later. As mentioned when discussing the connection between the expansions (.5) and (.6), the sequence of likelihood ratios for mixture models is bounded in probability when the parameter space is assumed compact; without compactness the sequence need not be bounded as shown for Gaussian mixtures by Hartigan (1985). Theorem 4.1 (Consistency). Let Mî be the model selected by maximizing the singular BIC, that is, î = arg max sbic(m i ). i I Under assumptions (A1)-(A3), the probability that Mî is a true model of minimal Bayes complexity (and thus also a smallest true model) tends to 1 as n. Since we are concerned with a finite set of models {M i : i I}, the consistency result in Theorem 4.1 follows from separate comparisons. More precisely, it suffices to show that (i) the singular BIC of any true model is asymptotically larger than that of any false model and (ii) the singular BIC of a true model can be asymptotically maximal only if the model minimizes Bayes complexity among the true models. The comparisons (i) and (ii) are addressed in Propositions 4.1 and 4., respectively. Throughout, L (M i ) refers to a coordinate of the unique positive solution to (3.7). Proposition 4.1. Under assumption (A), if model M i is true and model M k is false, then the probability that sbic(m i ) > sbic(m k ) tends to 1 as n. Proof. Fix an index j i and a second index h k. Since M k is false, (A) implies that the ratio L kh /L ij converges to zero in probability as n. Using Landau notation, L kh = o p(l ij ). Since j was arbitrary, L kh = o p(l i min ), where L i min = min{l ij : j i}; note that for fixed i and varying j the approximations L ij share the likelihood term and differ only in the learning coefficients or their multiplicities. According to (3.6), L (M k ) is a weighted average of the terms L kh with h k. We obtain that (4.) L (M k ) max{l kh : h k} = o p (L i min). Similarly, L (M i ) is a weighted average of the L ij, j i, and it thus holds that (4.3) L (M i ) L i min 0.

10 10 MATHIAS DRTON AND MARTYN PLUMMER We conclude that (4.4) L (M k ) = o p (L (M i )). It follows that (4.5) P(L (M i ) > L (M k )) 1 as n, which completes the proof because sbic(m i ) = log L (M i ). Lemma 4.1. Under assumption (A), if M i is a smallest true model, then sbic(m i ) = log(l ii) + o p (1). Proof. Suppose that a b and that M a is false and M b true. Then we know from (4.4) that L (M a ) = o p (L (M b )). Using the exponentially fast decay of the ratio in (A), the arguments in the proof of Proposition 4.1 also yield that L (M a )f(n) = o p (L (M b )) for any polynomial f(n). Since L ba /L b min is a deterministic function that grows at most polynomially with n, and since L b min L (M b ) according to (4.3), we have (4.6) L bal (M a ) = o p (L (M b ) ). Now, if an index i defines a smallest true model then, by (4.6), c i = o p (L (M i ) ) and b i + L ii = o p(l (M i )). From the quadratic equation defining L (M i ), we deduce that (4.7) L (M i ) L ii L (M i ) = o p (L (M i ) ). Hence, the equation s positive solution satisfies (4.8) L (M i ) = L ii(1 + o p (1)). Taking logarithms yields the claim. Lemma 4.. Suppose the data-generating distribution π 0 is in M k, but that M k is not a smallest true model. Then under assumptions (A1)-(A3), the probability that there exists a smallest true model M i M k with sbic(m i ) > sbic(m k ) tends to 1 as n. Proof. Let I 0 I contain the indices of true models, and let I 0 min I 0 contain the indices of the smallest true models. Define L k0 = max{l kj : j k, j I 0}. Since L (M k ) is a weighted average according to (3.5), we have that (4.9) L (M k ) L k0 + max{l kj : j k, j I 0 }. Consider now an index j k with j I 0. Then there exists i I 0 min such that i j. Since M k is not a smallest true model, i k. Hence, assumptions (A1) and (A3) imply that L kj = o p(l ii ). We deduce that L k0 = o p (max{l ii : i I 0 min, i k}). It follows from Lemma 4.1, or rather (4.8) that (4.10) L k0 = o p (max{l (M i ) : i I 0 min, i k}). Moreover, using the observations from the proof of Proposition 4.1, we have that (4.11) max{l kj : j k, j I 0 } = o p (L (M i )) for any i I 0 ; recall (4.)-(4.4). Combining (4.9), (4.10) and (4.11) shows that (4.1) L (M k ) = o p (max{l (M i ) : i I 0 min, i k}).

11 A BAYESIAN INFORMATION CRITERION FOR SINGULAR MODELS 11 Consequently, the maximum on the right-hand side will exceed L (M k ) with probability tending to 1. Hence, we see that with asymptotic probability 1 there must exist a smallest true model M i M k with sbic(m i ) > sbic(m k ). Proposition 4.. Suppose the data-generating distribution π 0 is in M k, but that M k does not minimize Bayes complexity among the true models. Then under assumptions (A1)-(A3), the probability that there exists a true model M i with minimal Bayes complexity and sbic(m i ) > sbic(m k ) tends to 1 as n. Proof. Let i, j I 0 min index two models, each being a smallest true model. Suppose further that M i has strictly smaller Bayes complexity than M j. From assumption (A1), it follows that L jj = o p(l ii ). Appealing to Lemma 4.1, we obtain (4.13) L (M j ) = o p (L (M i )). Taking up (4.1), we deduce that (4.14) L (M k ) = o p (max{l (M i ) : i B 0 min }), where B 0 min I 0 min indexes the true models of minimal Bayes complexity. Hence, with asymptotic probability 1, there exists an index i B 0 min for which it holds that sbic(m i ) > sbic(m k ). 5. Bayesian behavior Under assumption (A), the marginal likelihood of a false model is with high probability exponentially smaller than that of any true model. The frequentist large-sample behavior of Bayesian model selection procedures is thus primarily dictated by the asymptotics of the marginal likelihood integrals of true models, which is given by (3.1). As pointed out in Section 3, the usual BIC with penalty depending solely on model dimension generally does not reflect the asymptotic behavior of the marginal likelihood of a true model that is singular. Consequently, as the sample size increases, the Bayes factor obtained by forming the ratio of the marginal likelihood integrals for two true models may in-/decrease at a rate that is different from the rate for an approximate Bayes factor formed by exponentiating the difference of the two respective BIC scores. In this sense, there is generally nothing Bayesian about the usual BIC from Definition.1 when a model selection problem involves singular models. In contrast, we now show that in many interesting singular settings the new singular BIC stays close to the large-sample behavior of the log-marginal likelihood. For this result, we impose a further condition on the learning coefficients and their multiplicities. This condition is met by our motivating examples including reducedrank regression. (A4) The Bayes complexity of a true model M i is nondecreasing along true submodels, that is, if j and l index two true submodels of M i, then j l implies that ( λ ij, m ij ) ( λ il, m il ). Recall that throughout we assume that the learning coefficient λ i (π 0 ) of a model M i and its multiplicity m i (π 0 ) satisfy λ i (π 0 ) = λ ij and m i (π 0 ) = m ij for almost every distribution π 0 in a submodel M j of M i. Hence, a distribution π that is chosen at random according to our assumed prior i P (π M i)p (M i ) will satisfy the condition in the next theorem.

12 1 MATHIAS DRTON AND MARTYN PLUMMER Theorem 5.1. Suppose model M i contains the data-generating distribution π 0, and (λ ih, m ih ) is the minimal Bayes complexity of any true model M h. If under assumptions (A1)-(A4) we have λ i (π 0 ) = λ ih and m i (π 0 ) = m ih, then the marginal likelihood of M i satisfies log L(M i ) = sbic(m i ) + O p (1). Proof. Let B 0 min index the true models of minimal Bayes complexity. Then, by assumption on π 0, we have (5.1) L ih = L ig = L π 0 (M i ) for any two indices g, h that satisfy g, h i and g, h B 0 min ; recall (3.1) and (3.4). From (4.4) and (4.14), we get (5.) L (M k )P (M k ) j i L (M j )P (M j ) = o p(1) if k i but k B 0 min. (For clarity, we include the prior model probabilities in the exposition even though these were set to a common constant in Definition 3.1.) We may deduce from (5.1) and (5.) that (5.3) (5.4) (5.5) L (M i ) = L L (M k )P (M k ) ik k i j i L (M j )P (M j ) = L π 0 (M i ) L (M g )P (M g ) g i, g B 0 min j i L (M j )P (M j ) + o p(1) = L π 0 (M i ) (1 + o p (1)). The claim follows by taking logarithms and appealing to (.6). 6. Simulations for reduced-rank regression and factor analysis In this section we present numerical experiments comparing our sbic from Definition 3.1 to the usual BIC with penalty based on model dimension. First, we demonstrate that sbic can achieve superior frequentist model selection behavior. Next, we consider approximate posterior model probabilities obtained by normalizing exponentiated BIC values. We illustrate in examples that the use of sbic allows for more posterior mass being assigned to larger models, which seems more in line with fully Bayesian procedures for model determination (recall Theorem 5.1) Rank selection. We take up the setting of reduced-rank regression from Example.1 and Aoyagi and Watanabe (005). We consider a scenario with an N = 10 dimensional response and M = 15 covariates. We randomly generate an N M matrix of regression coefficients π of fixed rank 4. More precisely, we fix the signal strength by fixing the non-zero singular values of π to be 5/4, 1, 3/4 and 1/. The matrix π is then obtained by pre- and postmultiplying the diagonal matrix of singular values with orthogonal matrices drawn from uniform distributions. Given π, we generate n independent and identically distributed normal random vectors according to the reduced-rank regression model. From these data, rank estimates are obtained by maximizing Schwarz s BIC or the new sbic, respectively. For each value of n, we run 500 simulations with varying π. The results of the simulations are shown in Figure 6.1, in which the new sbic is seen to have clearly superior behavior in finite samples. For instance, with a sample

13 A BAYESIAN INFORMATION CRITERION FOR SINGULAR MODELS 13 n = 50 n = 100 n = Estimated rank Estimated rank Estimated rank n = 00 n = 50 n = Estimated rank Estimated rank Estimated rank n = 350 n = 400 n = Estimated rank Estimated rank Estimated rank Figure 6.1. Frequencies of rank estimates in reduced-rank regression using Schwarz s BIC (grey) and sbic (black). Results from 500 simulations with parameter matrices of true rank 4. size of n = 150, sbic identifies the true rank 4 in the majority of cases whereas the usual BIC selects a rank of or 3 in virtually all cases. Consistency of sbic appears to kick in at around n = 00 whereas the usual BIC needs about n = 400 to n = 500 data points before rank 4 is selected in the clear majority of cases. In Figure 6. we plot the entropies of the relative model selection frequencies that underlie Figure 6.1 against the sample size. The smoothed curves show that the consistency of sbic in selecting the true rank 4 goes hand in hand with the model selection frequencies concentrating on a single rank. This is not the case for the standard BIC, which at around sample size n = 50 selects the incorrect rank 3 in the vast majority of cases before it then begins to select rank 4. We have seen similar improvements when varying the size of the matrices or other aspects of the simulations. 6.. Factor analysis. Lopes and West (004, 6.3) fit Bayesian factor analysis models to data Y n concerning changes in the exchange rates of 6 currencies relative

14 14 MATHIAS DRTON AND MARTYN PLUMMER Entropy BIC sbic Sample size Figure 6.. Entropies of simulated model selection frequencies in reduced-rank regression, with loess curve fits. Results from 500 simulations with parameter matrices of true rank 4. to the British pound. The sample size is n = 143. The number of factors is restricted to be at most 3 so as to not overparametrize the 6 6 covariance matrix. Let M i be the model with i factors. In their Tables 3 and 5, Lopes and West (004) report the following two sets of posterior model probabilities obtained from reversible jump Markov chain Monte Carlo algorithms: (6.1) P (M 1 Y n ) = 0.00, P (M Y n ) = 0.88, P (M 3 Y n ) = 0.1 and (6.) P (M 1 Y n ) = 0.00, P (M Y n ) = 0.98, P (M 3 Y n ) = 0.0. The two cases are based on slightly different priors for the parameters of each model. We consider these same data and compute Schwarz s BIC as well as our singular BIC. For the singular BIC it is natural to consider the model M 0 that postulates independence of the 6 considered changes in exchange rates. Based on ongoing work of the first author and collaborators, we use the following learning coefficients λ ij for sbic: j = 0 j = 1 j = j = 3 i = 0 3 i = i = 6 9 i = with all multiplicities m ij = 1. These learning coefficients do not do not include the contribution of 6/ = 3 from the means of the six variables

15 A BAYESIAN INFORMATION CRITERION FOR SINGULAR MODELS 15 Exponentiating and renormalizing either set of BIC scores, we obtain the following approximate posterior model probabilities: (6.3) P (M 0 Y n ) P (M 1 Y n ) P (M Y n ) P (M 3 Y n ) BIC sbic Comparing (6.3) to (6.1) and (6.), we see that the approximation given by sbic gives results that are closer to the Monte Carlo approximations than those from the standard BIC which leads to overconfidence in model M. Of course, this assessment is necessarily subjective as it pertains to a comparison with two particular priors P (π i M i ) in each model. To further explore the connection between the information criteria and a fully Bayesian procedures, we subsampled the considered exchange rate data to create 10 data sets for each sample size n {5, 50, 75, 100}. For each data set we ran the Markov chain Monte Carlo algorithms of Lopes and West (004), focusing on the prior underlying (6.). In Figure 6.3 we present boxplots of the four posterior model probabilities. When comparing the spread in the approximate posterior probabilities, sbic gives a far better agreement with the fully Bayesian procedure than the standard BIC. For the considered data, the model uncertainty mostly concerns the decision between two and three factors and can be summarized by the Bayes factor for this model comparison. In Figure 6.4, we plot the log-bayes factors obtained from the Markov chain Monte Carlo procedure against those computed via the information criteria. The results from sbic are seen to be rather close to Bayesian; the filled points in the scatter plot cluster around the 45 degree line. The plot also illustrates one more time that BIC is overly certain about the number of factors being two. We note that the scatter plots for n = 5 and n = 100 had two and one dataset dropped, respectively, since those had at least one estimate of a marginal likelihood equal to zero Gaussian mixtures. Our final experiments pertain to univariate Gaussian mixtures. For Gaussian mixture models in which the variances of the component distributions are known and equal to a common value, it has been shown that (.5) and (.6) hold; as always in this theory a compactness assumption is made about the parameter space. In addition, the learning coefficients have been determined by Aoyagi (010a). However, in statistical practice, the variances are typically unknown and often the model with unequal variances is of interest. Despite the disconnection with the existing theory, we will treat Gaussian mixtures with unknown and unequal variances, assuming that the result in (.6) indeed applies. We are then facing a problem in which the learning coefficients have not been worked out. However, it is possible to give rather simple bounds and our point here is that these bounds are useful improvements over a count of all parameters in the model. Let M i be the Gaussian mixture model with i components, so π M i if i π = α h N (µ h, σh) h=1 for choices of means µ h R, variances σ h > 0 and mixture weights α h 0 that sum to one. Consider now a data-generating distribution π 0 M j M i. In order to represent π 0 as an element of M i, we may set i j of the mixture weights to

16 16 MATHIAS DRTON AND MARTYN PLUMMER n = 5 n = Bayes BIC sbic 1.0 Bayes BIC sbic Number of factors Number of factors n = 75 n = Bayes BIC sbic 1.0 Bayes BIC sbic Number of factors Number of factors Figure 6.3. Boxplots of posterior model probabilities in a factor analysis of exchange rate data under subsampling to size n {5, 50, 75, 100}: Results from a Markov chain Monte Carlo algorithm ( Bayes ), Schwarz s BIC and the new sbic. zero, which leaves i j of the means and i j of the variances parameters free. This fact leads to the bound (6.4) λ ij 1 [(i 1) + j] ; compare Section 7.3 in Watanabe (009). Note that for j < i this bound is strictly smaller than dim(m i )/ = (3i 1)/. We consider a familiar example, namely, the galaxies data set that was analyzed by many authors; see the review in Aitkin (001). We use the R package mclust (Fraley et al., 01) to fit the mixture models and then compute sbic treating the bounds from (6.4) as values of the learning coefficients λ ij. As is well-known, the likelihood function is unbounded when the variances are not bounded away

17 A BAYESIAN INFORMATION CRITERION FOR SINGULAR MODELS 17 n= 5 n= 50 BIC/sBIC BIC sbic BIC/sBIC BIC sbic Bayes Bayes n= 75 n= 100 BIC/sBIC BIC sbic BIC/sBIC BIC sbic Bayes Bayes Figure 6.4. Scatter plot of log-bayes factors comparing the results of a Markov chain Monte Carlo algorithm to BIC and sbic in a factor analysis of exchange rate data under subsampling to size n {5, 50, 75, 100}. from zero. The results we present are based on the best local maxima of the likelihood function found throughout repeated runs of the EM algorithm implemented in mclust. For each model, we ran the EM 5000 times with random initializations, which were created by drawing, independently for each data point, a vector of cluster membership probabilities from the uniform distribution on the relevant probability simplex. Figure 6.5 shows the values of BIC and sbic we obtained. These are converted into posterior model probabilities in Figure 6.6, where we also show posterior probabilities from the fully Bayesian analysis of Richardson and Green (1997). The conclusions are similar to those in the previous examples. As can be expected from theory, the standard BIC leads to selection of a smaller number of components than sbic, namely, 3 versus 6 components. The approximate posterior distribution based on BIC places essentially all mass on 3-5 components whereas that based on sbic concentrates on 5-8 components and is more similar to the results of the fully Bayesian analysis.

18 18 MATHIAS DRTON AND MARTYN PLUMMER Galaxies data: Mixture of Gaussians (unequal variances) 30 5 BIC sbic 0 BIC BIC Number of components Figure 6.5. Galaxies data: Values of BIC and sbic. Galaxies data: Mixture of Gaussians (unequal variances) MCMC BIC sbic Number of components Figure 6.6. Galaxies data: Posterior model probabilities from BIC, sbic and MCMC as per Richardson and Green (1997). 7. Conclusion In this paper we introduced a new Bayesian information criterion for singular statistical models. The new criterion, abbreviated sbic, is free of Monte Carlo computation and coincides with the widely-used criterion of Schwarz when the model is regular. Moreover, the criterion is consistent and maintains a rigorous connection to Bayesian approaches even in singular settings. This latter behavior is made possible by exploiting theoretical knowledge about the learning coefficients that capture the large-sample behavior of the concerned marginal likelihood integrals. For problems that involve a moderate number of models and that are amenable to an exhaustive model search, the computational effort going into the calculation

19 A BAYESIAN INFORMATION CRITERION FOR SINGULAR MODELS 19 of sbic scores is comparable to that for the ordinary BIC as the effort is typically dominated by the process of fitting all considered models to the available data. Computational strategies and approximations to sbic for problems with a large set of models constitute an interesting topic for future work. When treating problems with a large number of models it can be beneficial to adopt a non-uniform prior distribution on models; compare e.g. the work on regression models by Chen and Chen (008) and Scott and Berger (010), and the work on graphical models by Foygel and Drton (010) and Gao et al. (01). As mentioned in Remark 3.1, it is straightforward to incorporate prior model probabilities into the definition of sbic. Incorporating positive prior probabilities has no effect on the asymptotic results from Sections 4 and 5, as they pertain to the classical scenario of a fixed number of models and increasing sample size. To our knowledge, sbic is the first method to make use of information about the learning coefficients of singular models. It is this use of theoretical knowledge that allows one to avoid Monte Carlo computations. This said, the reliance on mathematical information is also what limits the applicability of sbic. As mentioned earlier a number of statistical models have been studied with regards to their learning coefficients. The new sbic provides strong positive motivation for further theoretical advances. For scenarios in which the computation of learning coefficients remains intractable the recent work of Watanabe (013) suggests an interesting new Markov chain Monte Carlo-based alternative for approximate Bayesian model determination. Acknowledgments This collaboration started at a workshop at the American Institute of Mathematics, and we would like to thank the participants of the workshop for helpful discussions. Particular thanks go to Vishesh Karwa and Dennis Leung for help with some of the numerical work. Mathias Drton was supported by the NSF (Grant No. DMS and DMS ) and by an Alfred P. Sloan Fellowship. References Aitkin, M. (001) Likelihood and Bayesian analysis of mixtures. Statistical Modelling, 1, Akaike, H. (1974) A new look at the statistical model identification. IEEE Trans. Automat. Control, AC-19, Aoyagi, M. (009) Log canonical threshold of Vandermonde matrix type singularities and generalization error of a three-layered neural network in Bayesian estimation. Int. J. Pure Appl. Math., 5, Aoyagi, M. (010a) A Bayesian learning coefficient of generalization error and Vandermonde matrix-type singularities. Comm. Statist. Theory Methods, 39, Aoyagi, M. (010b) Stochastic complexity and generalization error of a restricted Boltzmann machine in Bayesian estimation. J. Mach. Learn. Res., 11, Aoyagi, M. and Watanabe, S. (005) Stochastic complexities of reduced rank regression in Bayesian estimation. Neural Networks, 18, Arnol d, V. I., Guseĭn-Zade, S. M. and Varchenko, A. N. (1988) Singularities of differentiable maps. Vol. II, vol. 83 of Monographs in Mathematics. Boston, MA: Birkhäuser.

20 0 MATHIAS DRTON AND MARTYN PLUMMER Azaïs, J.-M., Gassiat, É. and Mercadier, C. (006) Asymptotic distribution and local power of the log-likelihood ratio test for mixtures: bounded and unbounded cases. Bernoulli, 1, Azaïs, J.-M., Gassiat, É. and Mercadier, C. (009) The likelihood ratio test for general mixture models with or without structural parameter. ESAIM Probab. Stat., 13, Chen, J. and Chen, Z. (008) Extended Bayesian information criterion for model selection with large model space. Biometrika, 95, Cheng, X. and Phillips, P. C. (01) Cointegrating rank selection in models with time-varying variance. Journal of Econometrics, 169, DiCiccio, T. J., Kass, R. E., Raftery, A. and Wasserman, L. (1997) Computing Bayes factors by combining simulation and asymptotic approximations. J. Amer. Statist. Assoc., 9, Drton, M. (009) Likelihood ratio tests and singularities. Ann. Statist., 37, Drton, M., Sturmfels, B. and Sullivant, S. (009) Lectures on algebraic statistics, vol. 39 of Oberwolfach Seminars. Basel: Birkhäuser Verlag. Foygel, R. and Drton, M. (010) Extended Bayesian information criteria for Gaussian graphical models. Adv. Neural Inf. Process. Syst., 3, Fraley, C., Raftery, A. E., Murphy, T. B. and Scrucca, L. (01) MCLUST version 4 for R: normal mixture modeling for model-based clustering, classification, and density estimation. Tech. Rep. 597, University of Washington, Department of Statistics. Friel, N. and Wyse, J. (01) Estimating the evidence a review. Stat. Neerl., 66, Gao, X., Pu, D. Q., Wu, Y. and Xu, H. (01) Tuning parameter selection for penalized likelihood estimation of Gaussian graphical model. Statist. Sinica,, Hartigan, J. A. (1985) A failure of likelihood asymptotics for normal mixtures. In Proceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer (eds. L. M. L. Cam and R. A. Olshen), vol. II, Wadsworth. Hastie, T., Tibshirani, R. and Friedman, J. (009) The elements of statistical learning. Springer Series in Statistics. New York: Springer, second edn. Data mining, inference, and prediction. Haughton, D. (1989) Size of the error in the choice of a model to fit data from an exponential family. Sankhyā Ser. A, 51, Haughton, D. M. A. (1988) On the choice of a model to fit data from an exponential family. Ann. Statist., 16, Kass, R. E. and Wasserman, L. (1995) A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. J. Amer. Statist. Assoc., 90, Keribin, C. (000) Consistent estimation of the order of mixture models. Sankhyā Ser. A, 6, Lin, S. (011) Asymptotic approximation of marginal likelihood integrals. arxiv: v. Lopes, H. F. and West, M. (004) Bayesian model assessment in factor analysis. Statist. Sinica, 14,

arxiv: v3 [stat.me] 23 Mar 2016

arxiv: v3 [stat.me] 23 Mar 2016 A BAYESIAN INFORMATION CRITERION FOR SINGULAR MODELS MATHIAS DRTON AND MARTYN PLUMMER arxiv:1309.0911v3 [stat.me] 3 Mar 016 Abstract. We consider approximate Bayesian model choice for model selection problems

More information

A Bayesian information criterion for singular models

A Bayesian information criterion for singular models J. R. Statist. Soc. B (2017) 79, Part 2, pp. 323 380 A Bayesian information criterion for singular models Mathias Drton University of Washington, Seattle, USA and Martyn Plummer International Agency for

More information

Asymptotic Approximation of Marginal Likelihood Integrals

Asymptotic Approximation of Marginal Likelihood Integrals Asymptotic Approximation of Marginal Likelihood Integrals Shaowei Lin 10 Dec 2008 Abstract We study the asymptotics of marginal likelihood integrals for discrete models using resolution of singularities

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

What is Singular Learning Theory?

What is Singular Learning Theory? What is Singular Learning Theory? Shaowei Lin (UC Berkeley) shaowei@math.berkeley.edu 23 Sep 2011 McGill University Singular Learning Theory A statistical model is regular if it is identifiable and its

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Extended Bayesian Information Criteria for Gaussian Graphical Models

Extended Bayesian Information Criteria for Gaussian Graphical Models Extended Bayesian Information Criteria for Gaussian Graphical Models Rina Foygel University of Chicago rina@uchicago.edu Mathias Drton University of Chicago drton@uchicago.edu Abstract Gaussian graphical

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Seminar über Statistik FS2008: Model Selection

Seminar über Statistik FS2008: Model Selection Seminar über Statistik FS2008: Model Selection Alessia Fenaroli, Ghazale Jazayeri Monday, April 2, 2008 Introduction Model Choice deals with the comparison of models and the selection of a model. It can

More information

Gaussian Mixtures. ## Type 'citation("mclust")' for citing this R package in publications.

Gaussian Mixtures. ## Type 'citation(mclust)' for citing this R package in publications. Gaussian Mixtures The galaxies data in the MASS package (Venables and Ripley, 2002) is a frequently used example for Gaussian mixture models. It contains the velocities of 82 galaxies from a redshift survey

More information

Lecture 21: Spectral Learning for Graphical Models

Lecture 21: Spectral Learning for Graphical Models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Probability and Information Theory. Sargur N. Srihari

Probability and Information Theory. Sargur N. Srihari Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

Supplementary material to Structure Learning of Linear Gaussian Structural Equation Models with Weak Edges

Supplementary material to Structure Learning of Linear Gaussian Structural Equation Models with Weak Edges Supplementary material to Structure Learning of Linear Gaussian Structural Equation Models with Weak Edges 1 PRELIMINARIES Two vertices X i and X j are adjacent if there is an edge between them. A path

More information

Algebraic Information Geometry for Learning Machines with Singularities

Algebraic Information Geometry for Learning Machines with Singularities Algebraic Information Geometry for Learning Machines with Singularities Sumio Watanabe Precision and Intelligence Laboratory Tokyo Institute of Technology 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

PART I INTRODUCTION The meaning of probability Basic definitions for frequentist statistics and Bayesian inference Bayesian inference Combinatorics

PART I INTRODUCTION The meaning of probability Basic definitions for frequentist statistics and Bayesian inference Bayesian inference Combinatorics Table of Preface page xi PART I INTRODUCTION 1 1 The meaning of probability 3 1.1 Classical definition of probability 3 1.2 Statistical definition of probability 9 1.3 Bayesian understanding of probability

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units Sahar Z Zangeneh Robert W. Keener Roderick J.A. Little Abstract In Probability proportional

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

More information

Analysis Methods for Supersaturated Design: Some Comparisons

Analysis Methods for Supersaturated Design: Some Comparisons Journal of Data Science 1(2003), 249-260 Analysis Methods for Supersaturated Design: Some Comparisons Runze Li 1 and Dennis K. J. Lin 2 The Pennsylvania State University Abstract: Supersaturated designs

More information

Penalized Loss functions for Bayesian Model Choice

Penalized Loss functions for Bayesian Model Choice Penalized Loss functions for Bayesian Model Choice Martyn International Agency for Research on Cancer Lyon, France 13 November 2009 The pure approach For a Bayesian purist, all uncertainty is represented

More information

An Extended BIC for Model Selection

An Extended BIC for Model Selection An Extended BIC for Model Selection at the JSM meeting 2007 - Salt Lake City Surajit Ray Boston University (Dept of Mathematics and Statistics) Joint work with James Berger, Duke University; Susie Bayarri,

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Lecture 2: Basic Concepts of Statistical Decision Theory

Lecture 2: Basic Concepts of Statistical Decision Theory EE378A Statistical Signal Processing Lecture 2-03/31/2016 Lecture 2: Basic Concepts of Statistical Decision Theory Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: John Miller and Aran Nayebi In this lecture

More information

INTRODUCTION TO PATTERN RECOGNITION

INTRODUCTION TO PATTERN RECOGNITION INTRODUCTION TO PATTERN RECOGNITION INSTRUCTOR: WEI DING 1 Pattern Recognition Automatic discovery of regularities in data through the use of computer algorithms With the use of these regularities to take

More information

Mixtures of Gaussians with Sparse Structure

Mixtures of Gaussians with Sparse Structure Mixtures of Gaussians with Sparse Structure Costas Boulis 1 Abstract When fitting a mixture of Gaussians to training data there are usually two choices for the type of Gaussians used. Either diagonal or

More information

Mixture Models and Representational Power of RBM s, DBN s and DBM s

Mixture Models and Representational Power of RBM s, DBN s and DBM s Mixture Models and Representational Power of RBM s, DBN s and DBM s Guido Montufar Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany. montufar@mis.mpg.de Abstract

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010 Model-Averaged l 1 Regularization using Markov Chain Monte Carlo Model Composition Technical Report No. 541 Department of Statistics, University of Washington Chris Fraley and Daniel Percival August 22,

More information

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization Tim Roughgarden & Gregory Valiant April 18, 2018 1 The Context and Intuition behind Regularization Given a dataset, and some class of models

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning 10-715 Advanced Introduction to Machine Learning Homework 3 Due Nov 12, 10.30 am Rules 1. Homework is due on the due date at 10.30 am. Please hand over your homework at the beginning of class. Please see

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

A note on Reversible Jump Markov Chain Monte Carlo

A note on Reversible Jump Markov Chain Monte Carlo A note on Reversible Jump Markov Chain Monte Carlo Hedibert Freitas Lopes Graduate School of Business The University of Chicago 5807 South Woodlawn Avenue Chicago, Illinois 60637 February, 1st 2006 1 Introduction

More information

Bayesian Assessment of Hypotheses and Models

Bayesian Assessment of Hypotheses and Models 8 Bayesian Assessment of Hypotheses and Models This is page 399 Printer: Opaque this 8. Introduction The three preceding chapters gave an overview of how Bayesian probability models are constructed. Once

More information

Introduction to Statistical modeling: handout for Math 489/583

Introduction to Statistical modeling: handout for Math 489/583 Introduction to Statistical modeling: handout for Math 489/583 Statistical modeling occurs when we are trying to model some data using statistical tools. From the start, we recognize that no model is perfect

More information

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,

More information

BAYESIAN DECISION THEORY

BAYESIAN DECISION THEORY Last updated: September 17, 2012 BAYESIAN DECISION THEORY Problems 2 The following problems from the textbook are relevant: 2.1 2.9, 2.11, 2.17 For this week, please at least solve Problem 2.3. We will

More information

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis Lecture 3 1 Probability (90 min.) Definition, Bayes theorem, probability densities and their properties, catalogue of pdfs, Monte Carlo 2 Statistical tests (90 min.) general concepts, test statistics,

More information

Uncertainty quantification and visualization for functional random variables

Uncertainty quantification and visualization for functional random variables Uncertainty quantification and visualization for functional random variables MascotNum Workshop 2014 S. Nanty 1,3 C. Helbert 2 A. Marrel 1 N. Pérot 1 C. Prieur 3 1 CEA, DEN/DER/SESI/LSMR, F-13108, Saint-Paul-lez-Durance,

More information

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population

More information

High-dimensional regression modeling

High-dimensional regression modeling High-dimensional regression modeling David Causeur Department of Statistics and Computer Science Agrocampus Ouest IRMAR CNRS UMR 6625 http://www.agrocampus-ouest.fr/math/causeur/ Course objectives Making

More information

Open Problems in Algebraic Statistics

Open Problems in Algebraic Statistics Open Problems inalgebraic Statistics p. Open Problems in Algebraic Statistics BERND STURMFELS UNIVERSITY OF CALIFORNIA, BERKELEY and TECHNISCHE UNIVERSITÄT BERLIN Advertisement Oberwolfach Seminar Algebraic

More information

Model Comparison. Course on Bayesian Inference, WTCN, UCL, February Model Comparison. Bayes rule for models. Linear Models. AIC and BIC.

Model Comparison. Course on Bayesian Inference, WTCN, UCL, February Model Comparison. Bayes rule for models. Linear Models. AIC and BIC. Course on Bayesian Inference, WTCN, UCL, February 2013 A prior distribution over model space p(m) (or hypothesis space ) can be updated to a posterior distribution after observing data y. This is implemented

More information

Weighted tests of homogeneity for testing the number of components in a mixture

Weighted tests of homogeneity for testing the number of components in a mixture Computational Statistics & Data Analysis 41 (2003) 367 378 www.elsevier.com/locate/csda Weighted tests of homogeneity for testing the number of components in a mixture Edward Susko Department of Mathematics

More information

Choosing a model in a Classification purpose. Guillaume Bouchard, Gilles Celeux

Choosing a model in a Classification purpose. Guillaume Bouchard, Gilles Celeux Choosing a model in a Classification purpose Guillaume Bouchard, Gilles Celeux Abstract: We advocate the usefulness of taking into account the modelling purpose when selecting a model. Two situations are

More information

Nonstationary spatial process modeling Part II Paul D. Sampson --- Catherine Calder Univ of Washington --- Ohio State University

Nonstationary spatial process modeling Part II Paul D. Sampson --- Catherine Calder Univ of Washington --- Ohio State University Nonstationary spatial process modeling Part II Paul D. Sampson --- Catherine Calder Univ of Washington --- Ohio State University this presentation derived from that presented at the Pan-American Advanced

More information

Recurrent Latent Variable Networks for Session-Based Recommendation

Recurrent Latent Variable Networks for Session-Based Recommendation Recurrent Latent Variable Networks for Session-Based Recommendation Panayiotis Christodoulou Cyprus University of Technology paa.christodoulou@edu.cut.ac.cy 27/8/2017 Panayiotis Christodoulou (C.U.T.)

More information

Analysis of the AIC Statistic for Optimal Detection of Small Changes in Dynamic Systems

Analysis of the AIC Statistic for Optimal Detection of Small Changes in Dynamic Systems Analysis of the AIC Statistic for Optimal Detection of Small Changes in Dynamic Systems Jeremy S. Conner and Dale E. Seborg Department of Chemical Engineering University of California, Santa Barbara, CA

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory Statistical Inference Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory IP, José Bioucas Dias, IST, 2007

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Testing Statistical Hypotheses

Testing Statistical Hypotheses E.L. Lehmann Joseph P. Romano Testing Statistical Hypotheses Third Edition 4y Springer Preface vii I Small-Sample Theory 1 1 The General Decision Problem 3 1.1 Statistical Inference and Statistical Decisions

More information

Asymptotic Analysis of the Bayesian Likelihood Ratio for Testing Homogeneity in Normal Mixture Models

Asymptotic Analysis of the Bayesian Likelihood Ratio for Testing Homogeneity in Normal Mixture Models Asymptotic Analysis of the Bayesian Likelihood Ratio for Testing Homogeneity in Normal Mixture Models arxiv:181.351v1 [math.st] 9 Dec 18 Natsuki Kariya, and Sumio Watanabe Department of Mathematical and

More information

Structure learning in human causal induction

Structure learning in human causal induction Structure learning in human causal induction Joshua B. Tenenbaum & Thomas L. Griffiths Department of Psychology Stanford University, Stanford, CA 94305 jbt,gruffydd @psych.stanford.edu Abstract We use

More information

Lecture 6: Model Checking and Selection

Lecture 6: Model Checking and Selection Lecture 6: Model Checking and Selection Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de May 27, 2014 Model selection We often have multiple modeling choices that are equally sensible: M 1,, M T. Which

More information

Stochastic Complexity of Variational Bayesian Hidden Markov Models

Stochastic Complexity of Variational Bayesian Hidden Markov Models Stochastic Complexity of Variational Bayesian Hidden Markov Models Tikara Hosino Department of Computational Intelligence and System Science, Tokyo Institute of Technology Mailbox R-5, 459 Nagatsuta, Midori-ku,

More information

Bayesian estimation of the discrepancy with misspecified parametric models

Bayesian estimation of the discrepancy with misspecified parametric models Bayesian estimation of the discrepancy with misspecified parametric models Pierpaolo De Blasi University of Torino & Collegio Carlo Alberto Bayesian Nonparametrics workshop ICERM, 17-21 September 2012

More information

Multivariate statistical methods and data mining in particle physics

Multivariate statistical methods and data mining in particle physics Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

A General Overview of Parametric Estimation and Inference Techniques.

A General Overview of Parametric Estimation and Inference Techniques. A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying

More information

arxiv: v1 [stat.me] 6 Nov 2013

arxiv: v1 [stat.me] 6 Nov 2013 Electronic Journal of Statistics Vol. 0 (0000) ISSN: 1935-7524 DOI: 10.1214/154957804100000000 A Generalized Savage-Dickey Ratio Ewan Cameron e-mail: dr.ewan.cameron@gmail.com url: astrostatistics.wordpress.com

More information

arxiv: v2 [stat.me] 23 Dec 2015

arxiv: v2 [stat.me] 23 Dec 2015 Marginal likelihood and model selection for Gaussian latent tree and forest models arxiv:141.885v [stat.me] 3 Dec 015 Mathias Drton 1 Shaowei Lin Luca Weihs 1 and Piotr Zwiernik 3 1 Department of Statistics,

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Post-Selection Inference

Post-Selection Inference Classical Inference start end start Post-Selection Inference selected end model data inference data selection model data inference Post-Selection Inference Todd Kuffner Washington University in St. Louis

More information

Algebraic Geometry and Model Selection

Algebraic Geometry and Model Selection Algebraic Geometry and Model Selection American Institute of Mathematics 2011/Dec/12-16 I would like to thank Prof. Russell Steele, Prof. Bernd Sturmfels, and all participants. Thank you very much. Sumio

More information

Invariant HPD credible sets and MAP estimators

Invariant HPD credible sets and MAP estimators Bayesian Analysis (007), Number 4, pp. 681 69 Invariant HPD credible sets and MAP estimators Pierre Druilhet and Jean-Michel Marin Abstract. MAP estimators and HPD credible sets are often criticized in

More information

Mixtures and Hidden Markov Models for analyzing genomic data

Mixtures and Hidden Markov Models for analyzing genomic data Mixtures and Hidden Markov Models for analyzing genomic data Marie-Laure Martin-Magniette UMR AgroParisTech/INRA Mathématique et Informatique Appliquées, Paris UMR INRA/UEVE ERL CNRS Unité de Recherche

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

A Brief Review of Probability, Bayesian Statistics, and Information Theory

A Brief Review of Probability, Bayesian Statistics, and Information Theory A Brief Review of Probability, Bayesian Statistics, and Information Theory Brendan Frey Electrical and Computer Engineering University of Toronto frey@psi.toronto.edu http://www.psi.toronto.edu A system

More information

Stochastic Realization of Binary Exchangeable Processes

Stochastic Realization of Binary Exchangeable Processes Stochastic Realization of Binary Exchangeable Processes Lorenzo Finesso and Cecilia Prosdocimi Abstract A discrete time stochastic process is called exchangeable if its n-dimensional distributions are,

More information

Learning Binary Classifiers for Multi-Class Problem

Learning Binary Classifiers for Multi-Class Problem Research Memorandum No. 1010 September 28, 2006 Learning Binary Classifiers for Multi-Class Problem Shiro Ikeda The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-8569,

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

Bayesian Inference in Astronomy & Astrophysics A Short Course

Bayesian Inference in Astronomy & Astrophysics A Short Course Bayesian Inference in Astronomy & Astrophysics A Short Course Tom Loredo Dept. of Astronomy, Cornell University p.1/37 Five Lectures Overview of Bayesian Inference From Gaussians to Periodograms Learning

More information

Testing Algebraic Hypotheses

Testing Algebraic Hypotheses Testing Algebraic Hypotheses Mathias Drton Department of Statistics University of Chicago 1 / 18 Example: Factor analysis Multivariate normal model based on conditional independence given hidden variable:

More information

1 Lyapunov theory of stability

1 Lyapunov theory of stability M.Kawski, APM 581 Diff Equns Intro to Lyapunov theory. November 15, 29 1 1 Lyapunov theory of stability Introduction. Lyapunov s second (or direct) method provides tools for studying (asymptotic) stability

More information

Discriminant analysis and supervised classification

Discriminant analysis and supervised classification Discriminant analysis and supervised classification Angela Montanari 1 Linear discriminant analysis Linear discriminant analysis (LDA) also known as Fisher s linear discriminant analysis or as Canonical

More information

More on Unsupervised Learning

More on Unsupervised Learning More on Unsupervised Learning Two types of problems are to find association rules for occurrences in common in observations (market basket analysis), and finding the groups of values of observational data

More information

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Jonathan Gruhl March 18, 2010 1 Introduction Researchers commonly apply item response theory (IRT) models to binary and ordinal

More information

Consistency of Test-based Criterion for Selection of Variables in High-dimensional Two Group-Discriminant Analysis

Consistency of Test-based Criterion for Selection of Variables in High-dimensional Two Group-Discriminant Analysis Consistency of Test-based Criterion for Selection of Variables in High-dimensional Two Group-Discriminant Analysis Yasunori Fujikoshi and Tetsuro Sakurai Department of Mathematics, Graduate School of Science,

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

VARIABLE SELECTION AND INDEPENDENT COMPONENT

VARIABLE SELECTION AND INDEPENDENT COMPONENT VARIABLE SELECTION AND INDEPENDENT COMPONENT ANALYSIS, PLUS TWO ADVERTS Richard Samworth University of Cambridge Joint work with Rajen Shah and Ming Yuan My core research interests A broad range of methodological

More information

How New Information Criteria WAIC and WBIC Worked for MLP Model Selection

How New Information Criteria WAIC and WBIC Worked for MLP Model Selection How ew Information Criteria WAIC and WBIC Worked for MLP Model Selection Seiya Satoh and Ryohei akano ational Institute of Advanced Industrial Science and Tech, --7 Aomi, Koto-ku, Tokyo, 5-6, Japan Chubu

More information

Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction

Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction Xiaodong Lin 1 and Yu Zhu 2 1 Statistical and Applied Mathematical Science Institute, RTP, NC, 27709 USA University of Cincinnati,

More information

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance

More information

Comment on Article by Scutari

Comment on Article by Scutari Bayesian Analysis (2013) 8, Number 3, pp. 543 548 Comment on Article by Scutari Hao Wang Scutari s paper studies properties of the distribution of graphs ppgq. This is an interesting angle because it differs

More information

FREQUENTIST BEHAVIOR OF FORMAL BAYESIAN INFERENCE

FREQUENTIST BEHAVIOR OF FORMAL BAYESIAN INFERENCE FREQUENTIST BEHAVIOR OF FORMAL BAYESIAN INFERENCE Donald A. Pierce Oregon State Univ (Emeritus), RERF Hiroshima (Retired), Oregon Health Sciences Univ (Adjunct) Ruggero Bellio Univ of Udine For Perugia

More information

Sampling Contingency Tables

Sampling Contingency Tables Sampling Contingency Tables Martin Dyer Ravi Kannan John Mount February 3, 995 Introduction Given positive integers and, let be the set of arrays with nonnegative integer entries and row sums respectively

More information