Bayesian variable selection in high dimensional problems without assumptions on prior model probabilities

Size: px

Start display at page:

Download "Bayesian variable selection in high dimensional problems without assumptions on prior model probabilities"

Jennifer Taylor
5 years ago
Views:

1 arxiv: v1 [stat.me] 11 Jul 2016 Bayesian variable selection in high dimensional problems without assumptions on prior model probabilities J. O. Berger 1, G. García-Donato 2, M. A. Martínez-Beneito 3 and V. Peña 1 1 Duke University; 2 Universidad de Castilla La Mancha; 3 FISABIO (Valencia) July 12, 2016 Abstract We consider the problem of variable selection in linear models when p, the number of potential regressors, may exceed (and perhaps substantially) the sample size n (which is possibly small). 1 Introduction and notation In model selection problems the uncertainty about which model has generated the data is explicitly considered. Variable selection is a particular problem of model selection where models share a common functional form but differ in the explanatory variables that constitute the models. We consider the problem of variable selection in linear models when p, the number of potential regressors, may exceed (and perhaps substantially) the sample size n (which is possibly small). See West (2002); Johnstone and Titterington (2009) for excellent introductions to the topic. Let y be a sample of n observations of the response variable and let X be the n p design matrix containing by columns the potential explanatory variables. As previously used in the literature we compactly express the set of all candidate models M γ M using a binary vector γ t = (γ 1,..., γ p ) 1

2 where each γ i is zero or one indicating whether the i-th covariate is included or not in M γ. Hence M = {M γ : γ {0, 1} p } where M γ : y = α1 n + X γ β γ + ɛ, ɛ N(0, σ 2 I n ), and X γ is the n k γ sub-matrix of X with columns defined by the 1 s in γ and with associated regression parameter β γ of dimension k γ = p i=1 γ i. We assume that if n > k γ then rank(1, X γ ) = k γ + 1 and if n k γ then rank(x γ ) = n. Finally, denote V γ the X γ with the columns centered on their means (i.e. V γ = (I P n )X γ where P n = 1 n 1 t n/n is the orthogonal projection onto the vector space defined by the intercept). We denote M 0 the null model (γ = 0) that has k 0 = 1 regressors (just the intercept). The problem with the null model containing no regressors (k 0 = 0) is very similar and will be considered throughout the paper. In this case, it is assumed that if n k γ then rank(x γ ) = k γ and if n < k γ then rank(x γ ) = n. Further V γ = X γ. In order to not duplicate all the formulas, at the price of abusing slightly notation, in what follows it should be understood that the parameter α does not exist when k 0 = 0. The formal Bayesian answer to the model selection problem is based on the posterior distribution over the model space f(γ y) B γ f(γ) where f(γ) is the prior probability of M γ and B γ is the Bayes factor (see ) of M γ to a fixed model here taken as M 0. B γ is the ratio between the integrated likelihoods m γ (y)/m 0 (y) where m γ (y) = M γ (y β γ, α, σ)π γ (β γ, α, σ)dβ γ dαdσ and π γ is the prior distribution, a quite delicate aspect of the Bayesian approach. Most of the popular model selection priors in this context, like g-priors, Zellner-Siow priors, the hyper-g priors, etc, share a similar functional form. This family of priors, that we call conventional priors, have been deeply studied by (Bayarri et al., 2012) showing that they have many appealing theoretical properties. In this paper we propose an extension of the conventional priors that covers the situation with more possible regressors than data points and that has the original conventional priors as particular case. We 2

3 call this priors regularized conventional priors. Our extension has important connections with other proposals in the literature like... We introduce the main motivating ideas in Section 2. The rest of the paper is organized as follows. 2 Conventional priors and motivating ideas In this work, we adopt the term conventional (used by Berger and Pericchi, 2001) to refer to a big family of priors that are extremely popular in the literature. In the standard scenario with more data points than possible regressors (n p+k 0 ), these are of the form π γ (β γ, α, σ) = σ 1 π γ (β γ α, σ), with the conditional distribution being an elliptical density of the type π γ (β γ α, σ) = 0 N kγ (β γ 0, ts γ ) p n (t) dt, (1) where S γ = σ 2 [V t γv γ ] 1 is the sampling variance matrix of the maximum likelihood estimator of β γ and p n (t) is a proper density that acts as a mixing density. The role of this matrix in S γ has been traditionally justified as giving sense to using the same improper prior distribution for common parameters π(α, σ) = σ 1 because this way common parameters are orthogonal to model specific parameters β γ (in an information Fisher s sense). Nevertheless, Bayarri et al. (2012) have shown that, in fact, using π(α, σ) = σ 1 can be formally justified with invariance and predictive matching argument and that orthogonality does not play any role. They have also shown that the use of P n in the definition of the prior scale is related with null predictive matching, providing a characterization of conventional priors. Interestingly, this criterion have important implications in this study as we will see. Conventional priors is a big family of priors that contains, through a particular choice of p n, very popular priors like the Zellner-Siow priors (Zellner and Siow, 1984), the g-priors (Zellner, 1986; Fernández et al., 2001), the hyper-g priors (Liang et al., 2008) or the robust priors (Bayarri et al., 2012) just to mention some. Recently, Bayarri et al. (2012) have shown that conventional priors have optimal properties in the sense that they satisfy several formal criteria. In particular, Bayarri et al. (2012) showed that, irrespectively of p n, conventional priors are measurement and group invariant and exact, dimensional and null predictive matching. 3

4 It is also very convenient that conventional priors lead to simple expression for the Bayes factors: B γ = (1 + t Q γ ) (n k0)/2 (1 + t) (n kγ k0)/2 p n (t) dt, (2) where Q γ = SSE γ /SSE 0 is the ratio of sums of squared errors of M γ to M 0. One appealing characteristic of the Robust prior proposed in Bayarri et al. (2012) is that the above integral can be expressed in closed form using a Hypergeometric function. An alternative formula for the Bayes factor is B γ = (1 + tn Q γ ) (n k0)/2 (1 + tn) (n kγ k0)/2 p n(t) dt, (3) where p n(t) = p n (tn)n. As a direct consequence of next result, the matrix S γ is defined only when n k γ + k 0. Result 1. The rank of the n k γ matrix V γ is n k 0 if n < k γ + k 0 and k γ if n k γ + k 0. Proof. The case with k 0 = 0 is a straight consequence of the assumptions about the rank of X γ. Show the case k 0 = 1 The implication is that, when n < p + k 0, conventional priors are not defined for all competing models since models M γ with k γ + k 0 > n would have an undefined prior scale matrix. Models in the model space can then be catalogued as regular (when k γ + k 0 < n), saturated (when k γ + k 0 = n) and singular (when k γ + k 0 > n). In part because of the problem described above, the development of Bayesian methods when M contains singular models have been inspired by other sources of motivations different from the conventional tradition. Among these approaches highlight those based on regularization methods like Lasso (least absolute shrinkage and selection operator) introduced by Tibshirani (1996) and their Bayesian counterparts (see eg. Park and Casella, 2008) of using a Laplace prior. The most appealing feature of Lasso is sparsity, meaning that when used over the full model (γ = 1) the estimation of certain regression parameters would be exactly zero. Hence, undoubtedly Lasso induces a type of variable selection but is not a formal model selection procedure since the most complex model is implicitly assumed as the true 4

5 model (there is no model uncertainty considered). In practical terms, the immediate consequence is that there is no a way of measuring the uncertainty regarding the model selection exercise. That limitation has been noticed by Hans (2010); Lykou and Ntzoufras (2013) who have embedded the Lasso approach into the formal framework of model selection adopting a multivariate Laplace prior for the specific regression parameters of each entertained model. Despite its undoubtedly value and interest, the only justification of these priors is their connection with the Lasso methodology and to the best of our knowledge there have not been proved any optimal property for these priors. The most distinctive feature of these priors with respect to the conventional priors is not the form of the prior itself (eg. the Laplace density can be defined as a mixture of a normal distribution) but the independency assumed among the regression parameters. This allows for a proper density but, as acknowledged by Lykou and Ntzoufras (2013), the Bayesian Lasso does not account for the structure of the covariates. A compromise between considering the structure of covariates within a formal model selection problem are the ridge-inspired procedures by Gupta and Ibrahim (2007) and Baragatti and Pommeret (2012). These authors have incorporated dependence among the regression coefficients using an extension of the g-prior that circumvents the singularity of the conventional scale matrix through the introduction of a ridge parameter λ. In particular, they propose using the scale matrix σ 2 [V t γv γ + λi] 1 (normally after a transformation of the covariates so that they have unitary scale). A drawback in this setting is the specification of the ridge parameter which in principle may have a strong impact on results. Also is that the priors used for the regular models are not conventional priors and do not share the optimal properties of the conventional priors. Another more sophisticated and also interesting extension of conventional priors is Maruyama and George (2012) that handles the case where n p through a singular value decomposition of X γ. (for our records: this is not really an extension as it does not have the g as a particular case hence the optimal properties are not inherited. For instance, is their prior invariant under changes in the units?). In Section 3 we define a generalization of conventional priors that we call regularized conventional priors. These are proper priors and are based on using, for the conditional scale for β γ a non-singular generalized inverse matrix of V t γv γ. Obviously, such matrices equal the inverse of V t γv γ for regular models, justifying that our proposal generalizes the conventional priors. The 5

6 form of the scale matrix has connections with the ridge-based approaches by Gupta and Ibrahim (2007) and Baragatti and Pommeret (2012). For singular models, regularized conventional priors are not univocally defined, but within a model, the posterior distribution of any estimable function is unique. This paper is about model selection and we will show a surprising result that, for singular models, the associated Bayes factor is one. As we will see this can be viewed as an extension of the null predictive matching criterion and is also congruent with full rank factorizations of the problem. The impossibility of distinction between singular models M γ is also implicitly present in other methodologies like lasso, where the result can never be one of such models (Rosset and Zhu, 2007 copy reference at the end!). Work more on this paragraph: states the important conclusion that then, what remains is nothing more than a multiple testing problem. When p is much larger than n there are a huge number of regular models that what arises is a multiple testing problem and a control for multiplicity is called for. The discussion and arguments in Scott and Berger (2006) points out that the proper Bayesian way of handling multiplicity issues is through the prior distribution over the model space. We adopt here their recommendation of using the prior ( ) p P (M γ ) = 1/(p + 1). (4) k γ which penalizes models in dimensions containing a large number models, where it is precisely more likely to appear more false signals. The rest of the paper is organized as follows. 3 Regularized conventional priors The non-singularity of the scale matrix in the ridge-based proposals by Gupta and Ibrahim (2007) and Baragatti and Pommeret (2012) is due to the addition of the diagonal dominant matrix λi which intuitively results in a substantial modification of the original scale matrix. This modification seems unneeded for regular models where S γ is non-singular and here we study alternative ways to define the scale matrix. In the following definition R( ) denotes the sub-space spanned by the rows of the matrix in the argument. Definition 1. For M γ, with parameters (α, β γ, σ 2 ) the regularized conven- 6

7 tional prior for β γ α, σ 2 is any proper prior of the form π γ (β γ α, σ) = 0 N kγ (β γ 0, ts γ) p n (t) dt, (5) where p n (t) is a mixing density; S γ = σ 2 [V t γv γ + T γ ] 1 and T γ any symmetric semi positive definite matrix such that R(V t γv γ ) R(T γ ) = R kγ. Note that these priors extend the conventional priors since for regular models, R(V t γv γ ) = R kγ and by definition T γ is the zero matrix. Further, for singular models these priors are proper and hence valid for model selection. Finally, note that these priors exist since we can always take as T γ = C t γc γ where C γ : (k γ n + k 0 ) k γ is of full rank and with rows that are independent of the rows of V γ. These way, for singular models, the regularized conventional priors can be seen as conventional priors but with respect to the extended design matrix ( V ) γ, C γ on which the missed values of the covariates (k γ n + k 0 cases up to completing the k γ ) are replaced by the imputed values in C γ. Of course, the regularized conventional priors depend on T γ (or on the imputed values C γ ), nevertheless the dependence happens in a way that the corresponding prior does not influence the likelihood. The reason why this is the case is contained in the following important result. Result 2. Let T γ a matrix defined as in 5, then the regular matrix [V t γv γ + T γ ] 1 is a generalized inverse of V t γv γ. Proof. An immediate consequence of Theorem , page 421 (Harville, 1997) and the fact that [V t γv γ + T γ ] 1 is a generalized inverse (because it is an inverse) of [V t γv γ + T γ ]. The above result justifies a very appealing interpretation of regularized conventional priors since the (not defined for all models) matrix (V t γv γ ) 1 is replaced by a regular generalized inverse (V t γv γ ). Said other way, we can keep the desired scale (that based on (V t γv γ )) but still using a regular matrix, hence defining a valid model selection prior. 7

8 Nevertheless, the regularized conventional priors are not unique (as there are many different ways of defining the matrix T γ ). Nevertheless a crucial property of these priors is invariance under such choice when estimating, given a model, a parameter univocally informed by the data (what are usually called estimable functions, see Harville 19??). We take this as evidence that regularized conventional priors are sensible priors and hence satisfy the principle in Berger and Pericchi (2001) about model selection priors. Result 3. Let M γ be any singular model and let θ be an estimable function, then the posterior distribution of θ given that M γ is the true does not depend on the choice of the matrix T γ. Proof. (Removing the subindex γ for simplicity) First show that where and β α, σ, t, y N k (m, Σ) m = [V t V (1 + t 1 ) + T t 1 ] 1 V t y Σ = σ 2 [V t V (1 + t 1 ) + T t 1 ] 1. Secondly show that [V t V (1 + t 1 ) + T t 1 ] 1 is a generalized inverse of V t V (1+t 1 ). It is well known that the matrix product V [V t V (1+t 1 )] V t does not depend on the generalized inverse implying that in our case H = V [V t V (1 + t 1 ) + T t 1 ] 1 V t particularly does not depend on T. Finally, notice that estimable functions are of the type θ = t t V where t is any vector of conformably dimension with posterior distribution which is independent on T. θ α, σ, t, y N k (t t Hy, t t Ht), (Perhaps show that the lasso and ridge-based approaches depends on several choices?) Our aim was on model selection. In the next result we show that the resulting Bayes factors of any singular model to the null is one. 8

9 Result 4. Let M γ be a singular model, then B γ = 1 independently of the choice of T γ and of p n (). Proof. (Removing the subindex γ for simplicity) According to Result 1, if M is singular then rank(v ) = n k 0 and this matrix admits a full rank factorization of the type V = LR where L : n (n k 0 ) and R : (n k 0 ) k are of full rank. Now in computing the integral m γ (y) apply the change of variables β R = Rβ and β C = Cβ, (here C : (k n + k 0 ) k is of full rank such that T = C t C) to show that ( R m γ (y) = m 0 (y) det(l t (I P n )L) 1/2 det(v t V + T ) 1/2 det C ) 1, (mainly due to the null predictive matching property of conventional priors). Now show that the factor to the right of m 0 (y) above is 1. 4 Unitary Bayes factors The main consequence of adopting the regularized conventional priors in a variable selection problem containing singular models is that for singular models B γ = 1 and for regular models B γ is a standard conventional Bayes factor.. There are a number of independent arguments that also support unitary Bayes factors for singular models. Null predictive matching This is one of the criteria proposed by Bayarri et al. (2012) to construct sensible objective prior distributions and reflects the idea -starting with Jeffreys (1961)- that data of minimal size for a given model should not allow one to distinguish between that model and the null. In the regular case this implies that, when n = k γ + k 0 (for saturated models) the Bayes factor of M γ to the null must be one. Interestingly, as highlighted by Bayarri et al. (2012), null predictive matching provides a characterization for the scale S γ since no other matrix (or a multiple) can achieve this predictive matching. This was taken by Bayarri et al. (2012) as positive evidence in favour of conventional priors (the ridge-based proposals do not have this property). The situation with n < k γ + k 0 is an extreme case of data of minimal size (more parameters than observations was explicitly mentioned by Jeffreys 9

10 (1961): check!) in the sense that like in the saturated case there is only enough data as to estimate estimable functions in M γ but not to distinguish it from then null. Reparameterization Consider first the following result: Result 5. Any singular model M γ admits a reparameterization as a saturated model M γ (ie with k γ = n k 0 ). Proof. (Revise to contain the case k 0 = 0) Without loss of generality any model M γ can be reparameterized as M γ : y = α1 n + V γ β γ + ɛ, ɛ N(0, σ 2 I n ), where V γ = (I P 0 )X γ. According to Result 1, if M γ is singular then rank(v γ ) = n 1 and this matrix admits a full rank factorization of the type V γ = L γ R γ where L γ : n (n 1) is of full column rank and R γ : (n 1) k γ. Hence we can parameterize M γ as M γ : y = α1 n + L γ β γ + ɛ, ɛ N(0, σ 2 I n ), (6) where β γ = R γ β γ is now n 1-dimensional. Show that M γ can be constructed in a way that it is a saturated (original) model. According to the result above, singular models M γ have an equivalent representations as saturated models (which, recall have associated a conventional Bayes factor to the null of one). This coincidence does not depend on the arbitrary choice of the matrices L and R used to construct the reparameterization. Notice that, a prior distribution without the null predictive matching property (out of conventional family) would easily lead to a Bayes factor that depends on these matrices. In what follows in this document we use the conventional Robust prior in Bayarri et al. (2012) because it is our preferred prior, but everything applies to the family of conventional priors. 5 The posterior distribution when p >> n Up to here the conclusion is that there is no informative content (coming from the data) in singular models (of course this does not imply that they do not 10

11 Dimension Type Subsets of M k γ + k 0 < n Regular M R k γ + k 0 = n Saturated M S k γ + k 0 > n Singular M S Table 1: Type of models influence the posterior distribution). Since n is expected to be much bigger than n, this implies that there is only information in a very few number of models (in the example in Section 6 with p = 8408 and n = 41 the proportion of regular models over the total number of models is of the order ). The obvious and crucial question that arises is if we can learn something or not. In what follows we denote M S the set of singular or saturated models (these share a unitary Bayes factor making it convenient to group in a common set) and denote M R the rest (formed by the regular models). See Table 1. Application of Bayes theorem leads that the posterior distribution can be expressed as a weighted average over the above defined subsets of M weighted by their corresponding posterior probabilities that we denote (recall M T represents the true model) P S = P r(m T M S y) and (1 P S ). Notice that P S M = γ M B S γ P r(m γ ) M γ M B S γ P r(m γ ) + M γ M B R γ P r(m γ ) then, and because of B γ = 1 for M γ M S : P S = P r(m S ) P r(m S ) + P r(m R ) C(n, p) where C(n, p) is the normalizing constant conditionally on M T M R, that is, C(n, p) = B γ P r(m γ M T M R ). M γ M R 11

12 For the particular case of the prior in (4), notice that of the p + 1 different dimensions, n k 0 correspond to M R and the rest, p n + k belong to M S and then P r(m S ) = (p n + k 0 + 1)/(p + 1), and hence P S = p n + k p n + k (n k 0 ) C(n, p). (7) Now, any summary of the posterior distribution can be constructed as weighted averages. One popular of such are the inclusion probabilities. The posterior inclusion probability of x i is q i = P r(x i M T y) = q R i (1 P S ) + q S i P S (8) where q R i denotes the inclusion probability conditional on M T M R and identical notation for q S i. For the case of the prior in (4) it can be seen that qi S = 1 p(p + 1) (n k 0 )(n k 0 1) 2 p(p n + k 0 + 1) that tends to 1/2 when p grows and n is either constant or grows at a rate n p a, for some 0 < a < 1 or n log(p) (here means asymptotic equivalence). When n grows linearly with p, say n f p for some 0 < f < 1 then q S i tends to (1 f 2 )/2. Then, if p is large enough, the informative content in q i depends on the magnitude of P S. 6 The methodology in practice and an illustrative application When M is very large (p is in the hundreds or more) it is quite difficult to figure out an algorithm that convincingly explores the model space. Nevertheless, for the problem here analyzed with n << p, we have argued that we know what happens in M S, a huge subset of the whole model space. The challenge is then how to manage that information to produce reliable results. In the previous section we have seen that what is essentially needed is an estimation of P S and q R i, and both quantities can be estimated with methods exploring efficiently M R (still a moderate to large model space). We put in practice the following scheme: 12

13 1. Use the Gibbs sampling algorithm studied and recommended in Garcia- Donato and Martinez-Beneito (2013) to obtain two samples of size N of the posterior distribution over M R (i.e. P r(m γ ) (4) if M γ M R and zero otherwise). 2. Check that convergence has been achieved (discharge burnin samples if needed) and use eq.(35) in George and McCulloch (1997) to estimate the normalizing constant C(n, p) (August 15: despite notation, this C is not a normalizing constant). Use it, in combination with (7) to estimate P S. At this point you know who wins. 3. Combine both samples to estimate q R i and use formula in (8) and P S to compute an estimate of q i. All steps above described can be done with a small p > n modification of the R package BayesVarSel by Garcia-Donato and Forte (2012) available upon requests from the authors. We illustrate the methodology using the simulation study based on a real dataset in Hans et al. (2007). These data consist on a gene expression dataset from a survival study in brain cancer (add reference) with n patients and p = 8408 genes from a tumor specimen. Exactly as in Hans et al. (2007) we define the true data generating model as y i = 1.3x i1 +.3x i2 1.2x i3.5x i4 + ɛ i, (9) where ɛ i N(0, 0.5) from which we simulated one dataset with n = 41 observations. As it is described in that paper, these four covariates where chosen in part because of previous information about these genes and also because they exhibit some correlation with other genes in the dataset. (Important: this problem does not have intercept and formulas have to be re-written for this situation. Results here presented already takes into account this). We run the algorithm for N = iterations, of which the first 1000 were discharged. Results are summarized in Table 2. First observation is that the posterior probability of the singular subset is very small (0.004) and basically the posterior distribution concentrates in the regular part. The results are quite informative: two of the four true covariates (x 1 and x 3 ) have a large inclusion probability and none of the 8404 spurious covariates have a non-negligible probability (the upper bound was 0.038). The variable x 2 has a small inclusion probability (0.154) but that is 13

14 n P S q 1 q 2 q 3 q 4 q T q U T HPM {x 1, x 3 } {x 1, x 3 } {x 4026, x 7748 } {Null,Full} Table 2: Hans et al. (2007) dataset. Keys: q i is the inclusion probability for x i (i = 1, 2, 3, 4) and q T, q T U are respectively the mean and maximum of the inclusion probabilities for the spurious variables. HPM is the estimated most probable a posteriori model. n p s q 1 q 2 q 3 q 4 q mean q T U 41 (beta(1,f(p))) 2.91e e-4 3.5e (beta(1,99)) 2.53e e (beta(1,9)) e Table 3: at least four times any inclusion probability of the spurious covariates. Interestingly, the highest posterior probability model (HPM), {x 1, x 3 } also gives extra evidence about the importance of these covariates in the experiment. Finally, to analyze the impact of n over p, we repeated the experiment with the first 10, 20 and 30 observations. Results are included also in Table 2. There we can clearly seen how the informative content in the data is overwhelmed by a large p when n is small. In the extreme, when n = 10, P S is and the posterior and prior distributions basically coincide. When n = 30 we have P S = and the information in the data starts being relevant and what we see there is the parsimony of the Bayesian approach and all the variables have a small inclusion probability. Still the HPM points to the importance of {x 1, x 3 }. Also, in Figure 1 we have represented the posterior distribution of the dimension of the true model for the different n s. What do we learn? Although Lasso s and our approach s results are hardly comparable (one 14

15 n=10 n=20 0e+00 2e 04 4e 04 6e 04 8e Dimension Dimension n=30 n= Dimension Dimension Figure 1: Posterior probabilities of the dimension of the true model (in the x-axis only represented the first 60 values). depends on a penalty, the other one summarises its results in terms of posterior probabilities,...), the popularity of Lasso for variables selection in the n << p setting worths a comparison with our proposal. Namely, we have run Lasso in our data set by using glmnet, the R package by Friedman et al. (2010). Figure 2 shows the Lasso fit for our whole dataset (n = 41 observations). Note that x 4 does not get into the Lasso fit for any value of λ, meanwhile x 3 is included into the model for a few values of log-λ around 0 and after being removed it is included again in models with lower values of log-λ (lower than -1). The value of λ to be used in Lasso is often chosen by crossvalidation. Namely, two criteria are particularly popular for setting it: the value yielding the minimum cross-validated error (MSE in our case), that we call λ min, and 15

16 Coefficients x1 x2 x3 λ min λ 1se Log Lambda Figure 2: Lasso applied to the example. as a function of (log-)λ the penalisation parameter on the l1-norm of the coefficients of the variables included in the model the estimated values of such coefficients. The upper horizontal axis shows for some values of (log-)λ the number of variables included in the model. Black lines correspond to the values of the coefficients for the true explanatory variables (x 1 to x 4 ) meanwhile gray lines stand for spurious variable included in the model for the different values of λ. the one-standard error rule i.e. that value of λ yielding the simplest model less than one standard error away of λ min. We call this last criterion λ 1se wich is intended to be a parsimonious alternative to λ min. The lower side of Figure 2 show the estimated values for λ min and λ 1se. Since these values are chosen by cross-validation they yield different values for different runs of the cross-validation thus we have plotted for each of them a segment covering the central 80% of the values obtained in 100 different cross-validations and the median value achieved in all those runs. According to λ min x 1, x 2 and x 3 are selected plus 17 to 38 spurious variables while λ 1se selects models ranging from 1 to 13 variables (the median value of λ 1se would select {x 2, x 3780, x 4494 } so the results of the λ 1se seem more sensible than those derived from λ 1se. References M. Baragatti and D. Pommeret. A study of variable selection using g-prior distribution with ridge parameter. Computational Statistics and Data 16

17 Analysis, 56: , Maria J. Bayarri, James O. Berger, Anabel Forte, and Gonzalo García- Donato. Criteria for Bayesian Model Choice with Application to Variable Selection. Annals of Statistics, 40: , James O. Berger and Luis R. Pericchi. Objective Bayesian methods for model selection: Introduction and comparison. Lecture Notes-Monograph Series, 38(3): , Carmen Fernández, Eduardo Ley, and Mark F. Steel. Benchmark priors for Bayesian model averaging. Journal of Political Economics, 100: , Jerome H. Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33, Gonzalo Garcia-Donato and Anabel Forte. BayesVarSel: Bayesian Variable selection in Linear Models. R package version 1.0, Gonzalo Garcia-Donato and Miguel A. Martinez-Beneito. On Sampling strategies in Bayesian variable selection problems with large model spaces. Journal of the American Statistical Association, 108(501), Edward I. George and Robert E. McCulloch. Approaches for bayesian variable selection. Statistica Sinica, 7: , Mayetri Gupta and Joseph G. Ibrahim. Variable selection in regression mixture modeling for the discovery of gene regulatory networks. Journal of American Statistical Association, 102: , Chris Hans. Model uncertainty and variable selection in bayesian lasso regression. Statistics and Computing, 20: , Chris Hans, Adrian Dobra, and Mike West. Shotgun stochastic search for regression with many candidate predictors. Journal of the American Statistical Association, 102: , Harold Jeffreys. Theory of Probability. Oxford University Press, 3rd edition,

18 Iain Johnstone and Michael Titterington. Statistical challenges of highdimensional data. Philosophical Transactions of The Royal Society A, 367 ( ), Feng Liang, Rui Paulo, German Molina, Merlise A. Clyde, and James O. Berger. Mixtures of g priors for Bayesian variable selection. Journal of the American Statistical Association, 103(481): , Anastasia Lykou and Ioannis Ntzoufras. On Bayesian lasso variable selection and the specification of the shrinkage parameter. Statistics and Computing, 23: , Yuzo Maruyama and Edward I. George. gbf: a fully bayes factor with a generalized g-prior. arxiv: v2, Trevor Park and George Casella. The bayesian lasso. Journal of American Statistical Association, 103: , James G. Scott and James O. Berger. An exploration of aspects of bayesian multiple testing. Journal of Statistical Planning and Inference, 136(7): , July R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58: , Mike West. Bayesian factor regression models in the large p, small n paradigm, volume Bayesian Statistics 7. Oxford University Press, Arnold Zellner. On assessing prior distributions and Bayesian regression analysis with g-prior distributions. In A. Zellner, editor, Bayesian Inference and Decision techniques: Essays in Honor of Bruno de Finetti, pages Edward Elgar Publishing Limited, Arnold Zellner and A. Siow. Basic Issues in Econometrics. Chicago: University of Chicago Press,

Divergence Based priors for the problem of hypothesis testing

Divergence Based priors for the problem of hypothesis testing gonzalo garcía-donato and susie Bayarri May 22, 2009 gonzalo garcía-donato and susie Bayarri () DB priors May 22, 2009 1 / 46 Jeffreys and