Bayesian variable selection in high dimensional problems without assumptions on prior model probabilities

Size: px
Start display at page:

Download "Bayesian variable selection in high dimensional problems without assumptions on prior model probabilities"

Transcription

1 arxiv: v1 [stat.me] 11 Jul 2016 Bayesian variable selection in high dimensional problems without assumptions on prior model probabilities J. O. Berger 1, G. García-Donato 2, M. A. Martínez-Beneito 3 and V. Peña 1 1 Duke University; 2 Universidad de Castilla La Mancha; 3 FISABIO (Valencia) July 12, 2016 Abstract We consider the problem of variable selection in linear models when p, the number of potential regressors, may exceed (and perhaps substantially) the sample size n (which is possibly small). 1 Introduction and notation In model selection problems the uncertainty about which model has generated the data is explicitly considered. Variable selection is a particular problem of model selection where models share a common functional form but differ in the explanatory variables that constitute the models. We consider the problem of variable selection in linear models when p, the number of potential regressors, may exceed (and perhaps substantially) the sample size n (which is possibly small). See West (2002); Johnstone and Titterington (2009) for excellent introductions to the topic. Let y be a sample of n observations of the response variable and let X be the n p design matrix containing by columns the potential explanatory variables. As previously used in the literature we compactly express the set of all candidate models M γ M using a binary vector γ t = (γ 1,..., γ p ) 1

2 where each γ i is zero or one indicating whether the i-th covariate is included or not in M γ. Hence M = {M γ : γ {0, 1} p } where M γ : y = α1 n + X γ β γ + ɛ, ɛ N(0, σ 2 I n ), and X γ is the n k γ sub-matrix of X with columns defined by the 1 s in γ and with associated regression parameter β γ of dimension k γ = p i=1 γ i. We assume that if n > k γ then rank(1, X γ ) = k γ + 1 and if n k γ then rank(x γ ) = n. Finally, denote V γ the X γ with the columns centered on their means (i.e. V γ = (I P n )X γ where P n = 1 n 1 t n/n is the orthogonal projection onto the vector space defined by the intercept). We denote M 0 the null model (γ = 0) that has k 0 = 1 regressors (just the intercept). The problem with the null model containing no regressors (k 0 = 0) is very similar and will be considered throughout the paper. In this case, it is assumed that if n k γ then rank(x γ ) = k γ and if n < k γ then rank(x γ ) = n. Further V γ = X γ. In order to not duplicate all the formulas, at the price of abusing slightly notation, in what follows it should be understood that the parameter α does not exist when k 0 = 0. The formal Bayesian answer to the model selection problem is based on the posterior distribution over the model space f(γ y) B γ f(γ) where f(γ) is the prior probability of M γ and B γ is the Bayes factor (see ) of M γ to a fixed model here taken as M 0. B γ is the ratio between the integrated likelihoods m γ (y)/m 0 (y) where m γ (y) = M γ (y β γ, α, σ)π γ (β γ, α, σ)dβ γ dαdσ and π γ is the prior distribution, a quite delicate aspect of the Bayesian approach. Most of the popular model selection priors in this context, like g-priors, Zellner-Siow priors, the hyper-g priors, etc, share a similar functional form. This family of priors, that we call conventional priors, have been deeply studied by (Bayarri et al., 2012) showing that they have many appealing theoretical properties. In this paper we propose an extension of the conventional priors that covers the situation with more possible regressors than data points and that has the original conventional priors as particular case. We 2

3 call this priors regularized conventional priors. Our extension has important connections with other proposals in the literature like... We introduce the main motivating ideas in Section 2. The rest of the paper is organized as follows. 2 Conventional priors and motivating ideas In this work, we adopt the term conventional (used by Berger and Pericchi, 2001) to refer to a big family of priors that are extremely popular in the literature. In the standard scenario with more data points than possible regressors (n p+k 0 ), these are of the form π γ (β γ, α, σ) = σ 1 π γ (β γ α, σ), with the conditional distribution being an elliptical density of the type π γ (β γ α, σ) = 0 N kγ (β γ 0, ts γ ) p n (t) dt, (1) where S γ = σ 2 [V t γv γ ] 1 is the sampling variance matrix of the maximum likelihood estimator of β γ and p n (t) is a proper density that acts as a mixing density. The role of this matrix in S γ has been traditionally justified as giving sense to using the same improper prior distribution for common parameters π(α, σ) = σ 1 because this way common parameters are orthogonal to model specific parameters β γ (in an information Fisher s sense). Nevertheless, Bayarri et al. (2012) have shown that, in fact, using π(α, σ) = σ 1 can be formally justified with invariance and predictive matching argument and that orthogonality does not play any role. They have also shown that the use of P n in the definition of the prior scale is related with null predictive matching, providing a characterization of conventional priors. Interestingly, this criterion have important implications in this study as we will see. Conventional priors is a big family of priors that contains, through a particular choice of p n, very popular priors like the Zellner-Siow priors (Zellner and Siow, 1984), the g-priors (Zellner, 1986; Fernández et al., 2001), the hyper-g priors (Liang et al., 2008) or the robust priors (Bayarri et al., 2012) just to mention some. Recently, Bayarri et al. (2012) have shown that conventional priors have optimal properties in the sense that they satisfy several formal criteria. In particular, Bayarri et al. (2012) showed that, irrespectively of p n, conventional priors are measurement and group invariant and exact, dimensional and null predictive matching. 3

4 It is also very convenient that conventional priors lead to simple expression for the Bayes factors: B γ = (1 + t Q γ ) (n k0)/2 (1 + t) (n kγ k0)/2 p n (t) dt, (2) where Q γ = SSE γ /SSE 0 is the ratio of sums of squared errors of M γ to M 0. One appealing characteristic of the Robust prior proposed in Bayarri et al. (2012) is that the above integral can be expressed in closed form using a Hypergeometric function. An alternative formula for the Bayes factor is B γ = (1 + tn Q γ ) (n k0)/2 (1 + tn) (n kγ k0)/2 p n(t) dt, (3) where p n(t) = p n (tn)n. As a direct consequence of next result, the matrix S γ is defined only when n k γ + k 0. Result 1. The rank of the n k γ matrix V γ is n k 0 if n < k γ + k 0 and k γ if n k γ + k 0. Proof. The case with k 0 = 0 is a straight consequence of the assumptions about the rank of X γ. Show the case k 0 = 1 The implication is that, when n < p + k 0, conventional priors are not defined for all competing models since models M γ with k γ + k 0 > n would have an undefined prior scale matrix. Models in the model space can then be catalogued as regular (when k γ + k 0 < n), saturated (when k γ + k 0 = n) and singular (when k γ + k 0 > n). In part because of the problem described above, the development of Bayesian methods when M contains singular models have been inspired by other sources of motivations different from the conventional tradition. Among these approaches highlight those based on regularization methods like Lasso (least absolute shrinkage and selection operator) introduced by Tibshirani (1996) and their Bayesian counterparts (see eg. Park and Casella, 2008) of using a Laplace prior. The most appealing feature of Lasso is sparsity, meaning that when used over the full model (γ = 1) the estimation of certain regression parameters would be exactly zero. Hence, undoubtedly Lasso induces a type of variable selection but is not a formal model selection procedure since the most complex model is implicitly assumed as the true 4

5 model (there is no model uncertainty considered). In practical terms, the immediate consequence is that there is no a way of measuring the uncertainty regarding the model selection exercise. That limitation has been noticed by Hans (2010); Lykou and Ntzoufras (2013) who have embedded the Lasso approach into the formal framework of model selection adopting a multivariate Laplace prior for the specific regression parameters of each entertained model. Despite its undoubtedly value and interest, the only justification of these priors is their connection with the Lasso methodology and to the best of our knowledge there have not been proved any optimal property for these priors. The most distinctive feature of these priors with respect to the conventional priors is not the form of the prior itself (eg. the Laplace density can be defined as a mixture of a normal distribution) but the independency assumed among the regression parameters. This allows for a proper density but, as acknowledged by Lykou and Ntzoufras (2013), the Bayesian Lasso does not account for the structure of the covariates. A compromise between considering the structure of covariates within a formal model selection problem are the ridge-inspired procedures by Gupta and Ibrahim (2007) and Baragatti and Pommeret (2012). These authors have incorporated dependence among the regression coefficients using an extension of the g-prior that circumvents the singularity of the conventional scale matrix through the introduction of a ridge parameter λ. In particular, they propose using the scale matrix σ 2 [V t γv γ + λi] 1 (normally after a transformation of the covariates so that they have unitary scale). A drawback in this setting is the specification of the ridge parameter which in principle may have a strong impact on results. Also is that the priors used for the regular models are not conventional priors and do not share the optimal properties of the conventional priors. Another more sophisticated and also interesting extension of conventional priors is Maruyama and George (2012) that handles the case where n p through a singular value decomposition of X γ. (for our records: this is not really an extension as it does not have the g as a particular case hence the optimal properties are not inherited. For instance, is their prior invariant under changes in the units?). In Section 3 we define a generalization of conventional priors that we call regularized conventional priors. These are proper priors and are based on using, for the conditional scale for β γ a non-singular generalized inverse matrix of V t γv γ. Obviously, such matrices equal the inverse of V t γv γ for regular models, justifying that our proposal generalizes the conventional priors. The 5

6 form of the scale matrix has connections with the ridge-based approaches by Gupta and Ibrahim (2007) and Baragatti and Pommeret (2012). For singular models, regularized conventional priors are not univocally defined, but within a model, the posterior distribution of any estimable function is unique. This paper is about model selection and we will show a surprising result that, for singular models, the associated Bayes factor is one. As we will see this can be viewed as an extension of the null predictive matching criterion and is also congruent with full rank factorizations of the problem. The impossibility of distinction between singular models M γ is also implicitly present in other methodologies like lasso, where the result can never be one of such models (Rosset and Zhu, 2007 copy reference at the end!). Work more on this paragraph: states the important conclusion that then, what remains is nothing more than a multiple testing problem. When p is much larger than n there are a huge number of regular models that what arises is a multiple testing problem and a control for multiplicity is called for. The discussion and arguments in Scott and Berger (2006) points out that the proper Bayesian way of handling multiplicity issues is through the prior distribution over the model space. We adopt here their recommendation of using the prior ( ) p P (M γ ) = 1/(p + 1). (4) k γ which penalizes models in dimensions containing a large number models, where it is precisely more likely to appear more false signals. The rest of the paper is organized as follows. 3 Regularized conventional priors The non-singularity of the scale matrix in the ridge-based proposals by Gupta and Ibrahim (2007) and Baragatti and Pommeret (2012) is due to the addition of the diagonal dominant matrix λi which intuitively results in a substantial modification of the original scale matrix. This modification seems unneeded for regular models where S γ is non-singular and here we study alternative ways to define the scale matrix. In the following definition R( ) denotes the sub-space spanned by the rows of the matrix in the argument. Definition 1. For M γ, with parameters (α, β γ, σ 2 ) the regularized conven- 6

7 tional prior for β γ α, σ 2 is any proper prior of the form π γ (β γ α, σ) = 0 N kγ (β γ 0, ts γ) p n (t) dt, (5) where p n (t) is a mixing density; S γ = σ 2 [V t γv γ + T γ ] 1 and T γ any symmetric semi positive definite matrix such that R(V t γv γ ) R(T γ ) = R kγ. Note that these priors extend the conventional priors since for regular models, R(V t γv γ ) = R kγ and by definition T γ is the zero matrix. Further, for singular models these priors are proper and hence valid for model selection. Finally, note that these priors exist since we can always take as T γ = C t γc γ where C γ : (k γ n + k 0 ) k γ is of full rank and with rows that are independent of the rows of V γ. These way, for singular models, the regularized conventional priors can be seen as conventional priors but with respect to the extended design matrix ( V ) γ, C γ on which the missed values of the covariates (k γ n + k 0 cases up to completing the k γ ) are replaced by the imputed values in C γ. Of course, the regularized conventional priors depend on T γ (or on the imputed values C γ ), nevertheless the dependence happens in a way that the corresponding prior does not influence the likelihood. The reason why this is the case is contained in the following important result. Result 2. Let T γ a matrix defined as in 5, then the regular matrix [V t γv γ + T γ ] 1 is a generalized inverse of V t γv γ. Proof. An immediate consequence of Theorem , page 421 (Harville, 1997) and the fact that [V t γv γ + T γ ] 1 is a generalized inverse (because it is an inverse) of [V t γv γ + T γ ]. The above result justifies a very appealing interpretation of regularized conventional priors since the (not defined for all models) matrix (V t γv γ ) 1 is replaced by a regular generalized inverse (V t γv γ ). Said other way, we can keep the desired scale (that based on (V t γv γ )) but still using a regular matrix, hence defining a valid model selection prior. 7

8 Nevertheless, the regularized conventional priors are not unique (as there are many different ways of defining the matrix T γ ). Nevertheless a crucial property of these priors is invariance under such choice when estimating, given a model, a parameter univocally informed by the data (what are usually called estimable functions, see Harville 19??). We take this as evidence that regularized conventional priors are sensible priors and hence satisfy the principle in Berger and Pericchi (2001) about model selection priors. Result 3. Let M γ be any singular model and let θ be an estimable function, then the posterior distribution of θ given that M γ is the true does not depend on the choice of the matrix T γ. Proof. (Removing the subindex γ for simplicity) First show that where and β α, σ, t, y N k (m, Σ) m = [V t V (1 + t 1 ) + T t 1 ] 1 V t y Σ = σ 2 [V t V (1 + t 1 ) + T t 1 ] 1. Secondly show that [V t V (1 + t 1 ) + T t 1 ] 1 is a generalized inverse of V t V (1+t 1 ). It is well known that the matrix product V [V t V (1+t 1 )] V t does not depend on the generalized inverse implying that in our case H = V [V t V (1 + t 1 ) + T t 1 ] 1 V t particularly does not depend on T. Finally, notice that estimable functions are of the type θ = t t V where t is any vector of conformably dimension with posterior distribution which is independent on T. θ α, σ, t, y N k (t t Hy, t t Ht), (Perhaps show that the lasso and ridge-based approaches depends on several choices?) Our aim was on model selection. In the next result we show that the resulting Bayes factors of any singular model to the null is one. 8

9 Result 4. Let M γ be a singular model, then B γ = 1 independently of the choice of T γ and of p n (). Proof. (Removing the subindex γ for simplicity) According to Result 1, if M is singular then rank(v ) = n k 0 and this matrix admits a full rank factorization of the type V = LR where L : n (n k 0 ) and R : (n k 0 ) k are of full rank. Now in computing the integral m γ (y) apply the change of variables β R = Rβ and β C = Cβ, (here C : (k n + k 0 ) k is of full rank such that T = C t C) to show that ( R m γ (y) = m 0 (y) det(l t (I P n )L) 1/2 det(v t V + T ) 1/2 det C ) 1, (mainly due to the null predictive matching property of conventional priors). Now show that the factor to the right of m 0 (y) above is 1. 4 Unitary Bayes factors The main consequence of adopting the regularized conventional priors in a variable selection problem containing singular models is that for singular models B γ = 1 and for regular models B γ is a standard conventional Bayes factor.. There are a number of independent arguments that also support unitary Bayes factors for singular models. Null predictive matching This is one of the criteria proposed by Bayarri et al. (2012) to construct sensible objective prior distributions and reflects the idea -starting with Jeffreys (1961)- that data of minimal size for a given model should not allow one to distinguish between that model and the null. In the regular case this implies that, when n = k γ + k 0 (for saturated models) the Bayes factor of M γ to the null must be one. Interestingly, as highlighted by Bayarri et al. (2012), null predictive matching provides a characterization for the scale S γ since no other matrix (or a multiple) can achieve this predictive matching. This was taken by Bayarri et al. (2012) as positive evidence in favour of conventional priors (the ridge-based proposals do not have this property). The situation with n < k γ + k 0 is an extreme case of data of minimal size (more parameters than observations was explicitly mentioned by Jeffreys 9

10 (1961): check!) in the sense that like in the saturated case there is only enough data as to estimate estimable functions in M γ but not to distinguish it from then null. Reparameterization Consider first the following result: Result 5. Any singular model M γ admits a reparameterization as a saturated model M γ (ie with k γ = n k 0 ). Proof. (Revise to contain the case k 0 = 0) Without loss of generality any model M γ can be reparameterized as M γ : y = α1 n + V γ β γ + ɛ, ɛ N(0, σ 2 I n ), where V γ = (I P 0 )X γ. According to Result 1, if M γ is singular then rank(v γ ) = n 1 and this matrix admits a full rank factorization of the type V γ = L γ R γ where L γ : n (n 1) is of full column rank and R γ : (n 1) k γ. Hence we can parameterize M γ as M γ : y = α1 n + L γ β γ + ɛ, ɛ N(0, σ 2 I n ), (6) where β γ = R γ β γ is now n 1-dimensional. Show that M γ can be constructed in a way that it is a saturated (original) model. According to the result above, singular models M γ have an equivalent representations as saturated models (which, recall have associated a conventional Bayes factor to the null of one). This coincidence does not depend on the arbitrary choice of the matrices L and R used to construct the reparameterization. Notice that, a prior distribution without the null predictive matching property (out of conventional family) would easily lead to a Bayes factor that depends on these matrices. In what follows in this document we use the conventional Robust prior in Bayarri et al. (2012) because it is our preferred prior, but everything applies to the family of conventional priors. 5 The posterior distribution when p >> n Up to here the conclusion is that there is no informative content (coming from the data) in singular models (of course this does not imply that they do not 10

11 Dimension Type Subsets of M k γ + k 0 < n Regular M R k γ + k 0 = n Saturated M S k γ + k 0 > n Singular M S Table 1: Type of models influence the posterior distribution). Since n is expected to be much bigger than n, this implies that there is only information in a very few number of models (in the example in Section 6 with p = 8408 and n = 41 the proportion of regular models over the total number of models is of the order ). The obvious and crucial question that arises is if we can learn something or not. In what follows we denote M S the set of singular or saturated models (these share a unitary Bayes factor making it convenient to group in a common set) and denote M R the rest (formed by the regular models). See Table 1. Application of Bayes theorem leads that the posterior distribution can be expressed as a weighted average over the above defined subsets of M weighted by their corresponding posterior probabilities that we denote (recall M T represents the true model) P S = P r(m T M S y) and (1 P S ). Notice that P S M = γ M B S γ P r(m γ ) M γ M B S γ P r(m γ ) + M γ M B R γ P r(m γ ) then, and because of B γ = 1 for M γ M S : P S = P r(m S ) P r(m S ) + P r(m R ) C(n, p) where C(n, p) is the normalizing constant conditionally on M T M R, that is, C(n, p) = B γ P r(m γ M T M R ). M γ M R 11

12 For the particular case of the prior in (4), notice that of the p + 1 different dimensions, n k 0 correspond to M R and the rest, p n + k belong to M S and then P r(m S ) = (p n + k 0 + 1)/(p + 1), and hence P S = p n + k p n + k (n k 0 ) C(n, p). (7) Now, any summary of the posterior distribution can be constructed as weighted averages. One popular of such are the inclusion probabilities. The posterior inclusion probability of x i is q i = P r(x i M T y) = q R i (1 P S ) + q S i P S (8) where q R i denotes the inclusion probability conditional on M T M R and identical notation for q S i. For the case of the prior in (4) it can be seen that qi S = 1 p(p + 1) (n k 0 )(n k 0 1) 2 p(p n + k 0 + 1) that tends to 1/2 when p grows and n is either constant or grows at a rate n p a, for some 0 < a < 1 or n log(p) (here means asymptotic equivalence). When n grows linearly with p, say n f p for some 0 < f < 1 then q S i tends to (1 f 2 )/2. Then, if p is large enough, the informative content in q i depends on the magnitude of P S. 6 The methodology in practice and an illustrative application When M is very large (p is in the hundreds or more) it is quite difficult to figure out an algorithm that convincingly explores the model space. Nevertheless, for the problem here analyzed with n << p, we have argued that we know what happens in M S, a huge subset of the whole model space. The challenge is then how to manage that information to produce reliable results. In the previous section we have seen that what is essentially needed is an estimation of P S and q R i, and both quantities can be estimated with methods exploring efficiently M R (still a moderate to large model space). We put in practice the following scheme: 12

13 1. Use the Gibbs sampling algorithm studied and recommended in Garcia- Donato and Martinez-Beneito (2013) to obtain two samples of size N of the posterior distribution over M R (i.e. P r(m γ ) (4) if M γ M R and zero otherwise). 2. Check that convergence has been achieved (discharge burnin samples if needed) and use eq.(35) in George and McCulloch (1997) to estimate the normalizing constant C(n, p) (August 15: despite notation, this C is not a normalizing constant). Use it, in combination with (7) to estimate P S. At this point you know who wins. 3. Combine both samples to estimate q R i and use formula in (8) and P S to compute an estimate of q i. All steps above described can be done with a small p > n modification of the R package BayesVarSel by Garcia-Donato and Forte (2012) available upon requests from the authors. We illustrate the methodology using the simulation study based on a real dataset in Hans et al. (2007). These data consist on a gene expression dataset from a survival study in brain cancer (add reference) with n patients and p = 8408 genes from a tumor specimen. Exactly as in Hans et al. (2007) we define the true data generating model as y i = 1.3x i1 +.3x i2 1.2x i3.5x i4 + ɛ i, (9) where ɛ i N(0, 0.5) from which we simulated one dataset with n = 41 observations. As it is described in that paper, these four covariates where chosen in part because of previous information about these genes and also because they exhibit some correlation with other genes in the dataset. (Important: this problem does not have intercept and formulas have to be re-written for this situation. Results here presented already takes into account this). We run the algorithm for N = iterations, of which the first 1000 were discharged. Results are summarized in Table 2. First observation is that the posterior probability of the singular subset is very small (0.004) and basically the posterior distribution concentrates in the regular part. The results are quite informative: two of the four true covariates (x 1 and x 3 ) have a large inclusion probability and none of the 8404 spurious covariates have a non-negligible probability (the upper bound was 0.038). The variable x 2 has a small inclusion probability (0.154) but that is 13

14 n P S q 1 q 2 q 3 q 4 q T q U T HPM {x 1, x 3 } {x 1, x 3 } {x 4026, x 7748 } {Null,Full} Table 2: Hans et al. (2007) dataset. Keys: q i is the inclusion probability for x i (i = 1, 2, 3, 4) and q T, q T U are respectively the mean and maximum of the inclusion probabilities for the spurious variables. HPM is the estimated most probable a posteriori model. n p s q 1 q 2 q 3 q 4 q mean q T U 41 (beta(1,f(p))) 2.91e e-4 3.5e (beta(1,99)) 2.53e e (beta(1,9)) e Table 3: at least four times any inclusion probability of the spurious covariates. Interestingly, the highest posterior probability model (HPM), {x 1, x 3 } also gives extra evidence about the importance of these covariates in the experiment. Finally, to analyze the impact of n over p, we repeated the experiment with the first 10, 20 and 30 observations. Results are included also in Table 2. There we can clearly seen how the informative content in the data is overwhelmed by a large p when n is small. In the extreme, when n = 10, P S is and the posterior and prior distributions basically coincide. When n = 30 we have P S = and the information in the data starts being relevant and what we see there is the parsimony of the Bayesian approach and all the variables have a small inclusion probability. Still the HPM points to the importance of {x 1, x 3 }. Also, in Figure 1 we have represented the posterior distribution of the dimension of the true model for the different n s. What do we learn? Although Lasso s and our approach s results are hardly comparable (one 14

15 n=10 n=20 0e+00 2e 04 4e 04 6e 04 8e Dimension Dimension n=30 n= Dimension Dimension Figure 1: Posterior probabilities of the dimension of the true model (in the x-axis only represented the first 60 values). depends on a penalty, the other one summarises its results in terms of posterior probabilities,...), the popularity of Lasso for variables selection in the n << p setting worths a comparison with our proposal. Namely, we have run Lasso in our data set by using glmnet, the R package by Friedman et al. (2010). Figure 2 shows the Lasso fit for our whole dataset (n = 41 observations). Note that x 4 does not get into the Lasso fit for any value of λ, meanwhile x 3 is included into the model for a few values of log-λ around 0 and after being removed it is included again in models with lower values of log-λ (lower than -1). The value of λ to be used in Lasso is often chosen by crossvalidation. Namely, two criteria are particularly popular for setting it: the value yielding the minimum cross-validated error (MSE in our case), that we call λ min, and 15

16 Coefficients x1 x2 x3 λ min λ 1se Log Lambda Figure 2: Lasso applied to the example. as a function of (log-)λ the penalisation parameter on the l1-norm of the coefficients of the variables included in the model the estimated values of such coefficients. The upper horizontal axis shows for some values of (log-)λ the number of variables included in the model. Black lines correspond to the values of the coefficients for the true explanatory variables (x 1 to x 4 ) meanwhile gray lines stand for spurious variable included in the model for the different values of λ. the one-standard error rule i.e. that value of λ yielding the simplest model less than one standard error away of λ min. We call this last criterion λ 1se wich is intended to be a parsimonious alternative to λ min. The lower side of Figure 2 show the estimated values for λ min and λ 1se. Since these values are chosen by cross-validation they yield different values for different runs of the cross-validation thus we have plotted for each of them a segment covering the central 80% of the values obtained in 100 different cross-validations and the median value achieved in all those runs. According to λ min x 1, x 2 and x 3 are selected plus 17 to 38 spurious variables while λ 1se selects models ranging from 1 to 13 variables (the median value of λ 1se would select {x 2, x 3780, x 4494 } so the results of the λ 1se seem more sensible than those derived from λ 1se. References M. Baragatti and D. Pommeret. A study of variable selection using g-prior distribution with ridge parameter. Computational Statistics and Data 16

17 Analysis, 56: , Maria J. Bayarri, James O. Berger, Anabel Forte, and Gonzalo García- Donato. Criteria for Bayesian Model Choice with Application to Variable Selection. Annals of Statistics, 40: , James O. Berger and Luis R. Pericchi. Objective Bayesian methods for model selection: Introduction and comparison. Lecture Notes-Monograph Series, 38(3): , Carmen Fernández, Eduardo Ley, and Mark F. Steel. Benchmark priors for Bayesian model averaging. Journal of Political Economics, 100: , Jerome H. Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33, Gonzalo Garcia-Donato and Anabel Forte. BayesVarSel: Bayesian Variable selection in Linear Models. R package version 1.0, Gonzalo Garcia-Donato and Miguel A. Martinez-Beneito. On Sampling strategies in Bayesian variable selection problems with large model spaces. Journal of the American Statistical Association, 108(501), Edward I. George and Robert E. McCulloch. Approaches for bayesian variable selection. Statistica Sinica, 7: , Mayetri Gupta and Joseph G. Ibrahim. Variable selection in regression mixture modeling for the discovery of gene regulatory networks. Journal of American Statistical Association, 102: , Chris Hans. Model uncertainty and variable selection in bayesian lasso regression. Statistics and Computing, 20: , Chris Hans, Adrian Dobra, and Mike West. Shotgun stochastic search for regression with many candidate predictors. Journal of the American Statistical Association, 102: , Harold Jeffreys. Theory of Probability. Oxford University Press, 3rd edition,

18 Iain Johnstone and Michael Titterington. Statistical challenges of highdimensional data. Philosophical Transactions of The Royal Society A, 367 ( ), Feng Liang, Rui Paulo, German Molina, Merlise A. Clyde, and James O. Berger. Mixtures of g priors for Bayesian variable selection. Journal of the American Statistical Association, 103(481): , Anastasia Lykou and Ioannis Ntzoufras. On Bayesian lasso variable selection and the specification of the shrinkage parameter. Statistics and Computing, 23: , Yuzo Maruyama and Edward I. George. gbf: a fully bayes factor with a generalized g-prior. arxiv: v2, Trevor Park and George Casella. The bayesian lasso. Journal of American Statistical Association, 103: , James G. Scott and James O. Berger. An exploration of aspects of bayesian multiple testing. Journal of Statistical Planning and Inference, 136(7): , July R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58: , Mike West. Bayesian factor regression models in the large p, small n paradigm, volume Bayesian Statistics 7. Oxford University Press, Arnold Zellner. On assessing prior distributions and Bayesian regression analysis with g-prior distributions. In A. Zellner, editor, Bayesian Inference and Decision techniques: Essays in Honor of Bruno de Finetti, pages Edward Elgar Publishing Limited, Arnold Zellner and A. Siow. Basic Issues in Econometrics. Chicago: University of Chicago Press,

Divergence Based priors for the problem of hypothesis testing

Divergence Based priors for the problem of hypothesis testing Divergence Based priors for the problem of hypothesis testing gonzalo garcía-donato and susie Bayarri May 22, 2009 gonzalo garcía-donato and susie Bayarri () DB priors May 22, 2009 1 / 46 Jeffreys and

More information

Bayesian Variable Selection Under Collinearity

Bayesian Variable Selection Under Collinearity Bayesian Variable Selection Under Collinearity Joyee Ghosh Andrew E. Ghattas. June 3, 2014 Abstract In this article we provide some guidelines to practitioners who use Bayesian variable selection for linear

More information

Or How to select variables Using Bayesian LASSO

Or How to select variables Using Bayesian LASSO Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO On Bayesian Variable Selection

More information

Posterior Model Probabilities via Path-based Pairwise Priors

Posterior Model Probabilities via Path-based Pairwise Priors Posterior Model Probabilities via Path-based Pairwise Priors James O. Berger 1 Duke University and Statistical and Applied Mathematical Sciences Institute, P.O. Box 14006, RTP, Durham, NC 27709, U.S.A.

More information

Analysis Methods for Supersaturated Design: Some Comparisons

Analysis Methods for Supersaturated Design: Some Comparisons Journal of Data Science 1(2003), 249-260 Analysis Methods for Supersaturated Design: Some Comparisons Runze Li 1 and Dennis K. J. Lin 2 The Pennsylvania State University Abstract: Supersaturated designs

More information

Lecture 5: Conventional Model Selection Priors

Lecture 5: Conventional Model Selection Priors Lecture 5: Conventional Model Selection Priors Susie Bayarri University of Valencia CBMS Conference on Model Uncertainty and Multiplicity July 23-28, 2012 Outline The general linear model and Orthogonalization

More information

Generalized Elastic Net Regression

Generalized Elastic Net Regression Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1

More information

Extending Conventional priors for Testing General Hypotheses in Linear Models

Extending Conventional priors for Testing General Hypotheses in Linear Models Extending Conventional priors for Testing General Hypotheses in Linear Models María Jesús Bayarri University of Valencia 46100 Valencia, Spain Gonzalo García-Donato University of Castilla-La Mancha 0071

More information

An Extended BIC for Model Selection

An Extended BIC for Model Selection An Extended BIC for Model Selection at the JSM meeting 2007 - Salt Lake City Surajit Ray Boston University (Dept of Mathematics and Statistics) Joint work with James Berger, Duke University; Susie Bayarri,

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation Patrick Breheny February 8 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/27 Introduction Basic idea Standardization Large-scale testing is, of course, a big area and we could keep talking

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables Qi Tang (Joint work with Kam-Wah Tsui and Sijian Wang) Department of Statistics University of Wisconsin-Madison Feb. 8,

More information

Biostatistics Advanced Methods in Biostatistics IV

Biostatistics Advanced Methods in Biostatistics IV Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results

More information

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models Proceedings 59th ISI World Statistics Congress, 25-30 August 2013, Hong Kong (Session CPS023) p.3938 An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models Vitara Pungpapong

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

7. Estimation and hypothesis testing. Objective. Recommended reading

7. Estimation and hypothesis testing. Objective. Recommended reading 7. Estimation and hypothesis testing Objective In this chapter, we show how the election of estimators can be represented as a decision problem. Secondly, we consider the problem of hypothesis testing

More information

Bayesian Model Averaging

Bayesian Model Averaging Bayesian Model Averaging Hoff Chapter 9, Hoeting et al 1999, Clyde & George 2004, Liang et al 2008 October 24, 2017 Bayesian Model Choice Models for the variable selection problem are based on a subset

More information

Bayesian model selection for computer model validation via mixture model estimation

Bayesian model selection for computer model validation via mixture model estimation Bayesian model selection for computer model validation via mixture model estimation Kaniav Kamary ATER, CNAM Joint work with É. Parent, P. Barbillon, M. Keller and N. Bousquet Outline Computer model validation

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Mixtures of Prior Distributions

Mixtures of Prior Distributions Mixtures of Prior Distributions Hoff Chapter 9, Liang et al 2007, Hoeting et al (1999), Clyde & George (2004) November 10, 2016 Bartlett s Paradox The Bayes factor for comparing M γ to the null model:

More information

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

Machine Learning for Economists: Part 4 Shrinkage and Sparsity Machine Learning for Economists: Part 4 Shrinkage and Sparsity Michal Andrle International Monetary Fund Washington, D.C., October, 2018 Disclaimer #1: The views expressed herein are those of the authors

More information

Lecture 6: Methods for high-dimensional problems

Lecture 6: Methods for high-dimensional problems Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,

More information

Model Selection. Frank Wood. December 10, 2009

Model Selection. Frank Wood. December 10, 2009 Model Selection Frank Wood December 10, 2009 Standard Linear Regression Recipe Identify the explanatory variables Decide the functional forms in which the explanatory variables can enter the model Decide

More information

Mixtures of Prior Distributions

Mixtures of Prior Distributions Mixtures of Prior Distributions Hoff Chapter 9, Liang et al 2007, Hoeting et al (1999), Clyde & George (2004) November 9, 2017 Bartlett s Paradox The Bayes factor for comparing M γ to the null model: BF

More information

Some Curiosities Arising in Objective Bayesian Analysis

Some Curiosities Arising in Objective Bayesian Analysis . Some Curiosities Arising in Objective Bayesian Analysis Jim Berger Duke University Statistical and Applied Mathematical Institute Yale University May 15, 2009 1 Three vignettes related to John s work

More information

Exploratory quantile regression with many covariates: An application to adverse birth outcomes

Exploratory quantile regression with many covariates: An application to adverse birth outcomes Exploratory quantile regression with many covariates: An application to adverse birth outcomes June 3, 2011 eappendix 30 Percent of Total 20 10 0 0 1000 2000 3000 4000 5000 Birth weights efigure 1: Histogram

More information

Bayesian Variable Selection Under Collinearity

Bayesian Variable Selection Under Collinearity Bayesian Variable Selection Under Collinearity Joyee Ghosh Andrew E. Ghattas. Abstract In this article we highlight some interesting facts about Bayesian variable selection methods for linear regression

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10 COS53: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 0 MELISSA CARROLL, LINJIE LUO. BIAS-VARIANCE TRADE-OFF (CONTINUED FROM LAST LECTURE) If V = (X n, Y n )} are observed data, the linear regression problem

More information

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013) A Survey of L 1 Regression Vidaurre, Bielza and Larranaga (2013) Céline Cunen, 20/10/2014 Outline of article 1.Introduction 2.The Lasso for Linear Regression a) Notation and Main Concepts b) Statistical

More information

Stability and the elastic net

Stability and the elastic net Stability and the elastic net Patrick Breheny March 28 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/32 Introduction Elastic Net Our last several lectures have concentrated on methods for

More information

Methods and Tools for Bayesian Variable Selection and Model Averaging in Univariate Linear Regression

Methods and Tools for Bayesian Variable Selection and Model Averaging in Univariate Linear Regression Methods and Tools for Bayesian Variable Selection and Model Averaging in Univariate Linear Regression Anabel Forte, Department of Statistics and Operations research, University of Valencia Gonzalo García-Donato

More information

1 Hypothesis Testing and Model Selection

1 Hypothesis Testing and Model Selection A Short Course on Bayesian Inference (based on An Introduction to Bayesian Analysis: Theory and Methods by Ghosh, Delampady and Samanta) Module 6: From Chapter 6 of GDS 1 Hypothesis Testing and Model Selection

More information

The Bayesian Approach to Multi-equation Econometric Model Estimation

The Bayesian Approach to Multi-equation Econometric Model Estimation Journal of Statistical and Econometric Methods, vol.3, no.1, 2014, 85-96 ISSN: 2241-0384 (print), 2241-0376 (online) Scienpress Ltd, 2014 The Bayesian Approach to Multi-equation Econometric Model Estimation

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 1 Ridge Regression Ridge regression and the Lasso are two forms of regularized

More information

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Journal of Data Science 9(2011), 549-564 Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Masaru Kanba and Kanta Naito Shimane University Abstract: This paper discusses the

More information

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010 Model-Averaged l 1 Regularization using Markov Chain Monte Carlo Model Composition Technical Report No. 541 Department of Statistics, University of Washington Chris Fraley and Daniel Percival August 22,

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

More information

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Bayesian Grouped Horseshoe Regression with Application to Additive Models Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu 1,2, Daniel F. Schmidt 1, Enes Makalic 1, Guoqi Qian 2, John L. Hopper 1 1 Centre for Epidemiology and Biostatistics,

More information

Bayesian methods in economics and finance

Bayesian methods in economics and finance 1/26 Bayesian methods in economics and finance Linear regression: Bayesian model selection and sparsity priors Linear Regression 2/26 Linear regression Model for relationship between (several) independent

More information

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Bayesian Grouped Horseshoe Regression with Application to Additive Models Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu, Daniel F. Schmidt, Enes Makalic, Guoqi Qian, and John L. Hopper Centre for Epidemiology and Biostatistics, Melbourne

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Minimax design criterion for fractional factorial designs

Minimax design criterion for fractional factorial designs Ann Inst Stat Math 205 67:673 685 DOI 0.007/s0463-04-0470-0 Minimax design criterion for fractional factorial designs Yue Yin Julie Zhou Received: 2 November 203 / Revised: 5 March 204 / Published online:

More information

Lecture 17 Intro to Lasso Regression

Lecture 17 Intro to Lasso Regression Lecture 17 Intro to Lasso Regression 11 November 2015 Taylor B. Arnold Yale Statistics STAT 312/612 Notes problem set 5 posted; due today Goals for today introduction to lasso regression the subdifferential

More information

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Linear Models 1. Isfahan University of Technology Fall Semester, 2014 Linear Models 1 Isfahan University of Technology Fall Semester, 2014 References: [1] G. A. F., Seber and A. J. Lee (2003). Linear Regression Analysis (2nd ed.). Hoboken, NJ: Wiley. [2] A. C. Rencher and

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Penalized likelihood logistic regression with rare events

Penalized likelihood logistic regression with rare events Penalized likelihood logistic regression with rare events Georg Heinze 1, Angelika Geroldinger 1, Rainer Puhr 2, Mariana Nold 3, Lara Lusa 4 1 Medical University of Vienna, CeMSIIS, Section for Clinical

More information

Power-Expected-Posterior Priors for Variable Selection in Gaussian Linear Models

Power-Expected-Posterior Priors for Variable Selection in Gaussian Linear Models Power-Expected-Posterior Priors for Variable Selection in Gaussian Linear Models Ioannis Ntzoufras, Department of Statistics, Athens University of Economics and Business, Athens, Greece; e-mail: ntzoufras@aueb.gr.

More information

Supplement to Bayesian inference for high-dimensional linear regression under the mnet priors

Supplement to Bayesian inference for high-dimensional linear regression under the mnet priors The Canadian Journal of Statistics Vol. xx No. yy 0?? Pages?? La revue canadienne de statistique Supplement to Bayesian inference for high-dimensional linear regression under the mnet priors Aixin Tan

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Bayesian model selection: methodology, computation and applications

Bayesian model selection: methodology, computation and applications Bayesian model selection: methodology, computation and applications David Nott Department of Statistics and Applied Probability National University of Singapore Statistical Genomics Summer School Program

More information

Lecture Notes 1: Decisions and Data. In these notes, I describe some basic ideas in decision theory. theory is constructed from

Lecture Notes 1: Decisions and Data. In these notes, I describe some basic ideas in decision theory. theory is constructed from Topics in Data Analysis Steven N. Durlauf University of Wisconsin Lecture Notes : Decisions and Data In these notes, I describe some basic ideas in decision theory. theory is constructed from The Data:

More information

Invariant HPD credible sets and MAP estimators

Invariant HPD credible sets and MAP estimators Bayesian Analysis (007), Number 4, pp. 681 69 Invariant HPD credible sets and MAP estimators Pierre Druilhet and Jean-Michel Marin Abstract. MAP estimators and HPD credible sets are often criticized in

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

Statistical Inference

Statistical Inference Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park

More information

Overall Objective Priors

Overall Objective Priors Overall Objective Priors Jim Berger, Jose Bernardo and Dongchu Sun Duke University, University of Valencia and University of Missouri Recent advances in statistical inference: theory and case studies University

More information

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1 Variable Selection in Restricted Linear Regression Models Y. Tuaç 1 and O. Arslan 1 Ankara University, Faculty of Science, Department of Statistics, 06100 Ankara/Turkey ytuac@ankara.edu.tr, oarslan@ankara.edu.tr

More information

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement

More information

Model Choice. Hoff Chapter 9, Clyde & George Model Uncertainty StatSci, Hoeting et al BMA StatSci. October 27, 2015

Model Choice. Hoff Chapter 9, Clyde & George Model Uncertainty StatSci, Hoeting et al BMA StatSci. October 27, 2015 Model Choice Hoff Chapter 9, Clyde & George Model Uncertainty StatSci, Hoeting et al BMA StatSci October 27, 2015 Topics Variable Selection / Model Choice Stepwise Methods Model Selection Criteria Model

More information

Tuning Parameter Selection in L1 Regularized Logistic Regression

Tuning Parameter Selection in L1 Regularized Logistic Regression Virginia Commonwealth University VCU Scholars Compass Theses and Dissertations Graduate School 2012 Tuning Parameter Selection in L1 Regularized Logistic Regression Shujing Shi Virginia Commonwealth University

More information

Supplementary materials for Scalable Bayesian model averaging through local information propagation

Supplementary materials for Scalable Bayesian model averaging through local information propagation Supplementary materials for Scalable Bayesian model averaging through local information propagation August 25, 2014 S1. Proofs Proof of Theorem 1. The result follows immediately from the distributions

More information

Statistical Machine Learning, Part I. Regression 2

Statistical Machine Learning, Part I. Regression 2 Statistical Machine Learning, Part I Regression 2 mcuturi@i.kyoto-u.ac.jp SML-2015 1 Last Week Regression: highlight a functional relationship between a predicted variable and predictors SML-2015 2 Last

More information

7. Estimation and hypothesis testing. Objective. Recommended reading

7. Estimation and hypothesis testing. Objective. Recommended reading 7. Estimation and hypothesis testing Objective In this chapter, we show how the election of estimators can be represented as a decision problem. Secondly, we consider the problem of hypothesis testing

More information

Linear Regression Models

Linear Regression Models Linear Regression Models Model Description and Model Parameters Modelling is a central theme in these notes. The idea is to develop and continuously improve a library of predictive models for hazards,

More information

Comment on Article by Scutari

Comment on Article by Scutari Bayesian Analysis (2013) 8, Number 3, pp. 543 548 Comment on Article by Scutari Hao Wang Scutari s paper studies properties of the distribution of graphs ppgq. This is an interesting angle because it differs

More information

Sparse Factor-Analytic Probit Models

Sparse Factor-Analytic Probit Models Sparse Factor-Analytic Probit Models By JAMES G. SCOTT Department of Statistical Science, Duke University, Durham, North Carolina 27708-0251, U.S.A. james@stat.duke.edu PAUL R. HAHN Department of Statistical

More information

Bayes methods for categorical data. April 25, 2017

Bayes methods for categorical data. April 25, 2017 Bayes methods for categorical data April 25, 2017 Motivation for joint probability models Increasing interest in high-dimensional data in broad applications Focus may be on prediction, variable selection,

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

Divergence Based Priors for Bayesian Hypothesis testing

Divergence Based Priors for Bayesian Hypothesis testing Divergence Based Priors for Bayesian Hypothesis testing M.J. Bayarri University of Valencia G. García-Donato University of Castilla-La Mancha November, 2006 Abstract Maybe the main difficulty for objective

More information

A Course in Applied Econometrics Lecture 7: Cluster Sampling. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

A Course in Applied Econometrics Lecture 7: Cluster Sampling. Jeff Wooldridge IRP Lectures, UW Madison, August 2008 A Course in Applied Econometrics Lecture 7: Cluster Sampling Jeff Wooldridge IRP Lectures, UW Madison, August 2008 1. The Linear Model with Cluster Effects 2. Estimation with a Small Number of roups and

More information

Power-Expected-Posterior Priors for Variable Selection in Gaussian Linear Models

Power-Expected-Posterior Priors for Variable Selection in Gaussian Linear Models Power-Expected-Posterior Priors for Variable Selection in Gaussian Linear Models Dimitris Fouskakis, Department of Mathematics, School of Applied Mathematical and Physical Sciences, National Technical

More information

Outlier detection in ARIMA and seasonal ARIMA models by. Bayesian Information Type Criteria

Outlier detection in ARIMA and seasonal ARIMA models by. Bayesian Information Type Criteria Outlier detection in ARIMA and seasonal ARIMA models by Bayesian Information Type Criteria Pedro Galeano and Daniel Peña Departamento de Estadística Universidad Carlos III de Madrid 1 Introduction The

More information

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview

More information

Bayesian Estimation of Regression Coefficients Under Extended Balanced Loss Function

Bayesian Estimation of Regression Coefficients Under Extended Balanced Loss Function Communications in Statistics Theory and Methods, 43: 4253 4264, 2014 Copyright Taylor & Francis Group, LLC ISSN: 0361-0926 print / 1532-415X online DOI: 10.1080/03610926.2012.725498 Bayesian Estimation

More information

g-priors for Linear Regression

g-priors for Linear Regression Stat60: Bayesian Modeling and Inference Lecture Date: March 15, 010 g-priors for Linear Regression Lecturer: Michael I. Jordan Scribe: Andrew H. Chan 1 Linear regression and g-priors In the last lecture,

More information

Linear Regression (9/11/13)

Linear Regression (9/11/13) STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter

More information

Variable Selection in Predictive Regressions

Variable Selection in Predictive Regressions Variable Selection in Predictive Regressions Alessandro Stringhi Advanced Financial Econometrics III Winter/Spring 2018 Overview This chapter considers linear models for explaining a scalar variable when

More information

Foundation of Intelligent Systems, Part I. Regression 2

Foundation of Intelligent Systems, Part I. Regression 2 Foundation of Intelligent Systems, Part I Regression 2 mcuturi@i.kyoto-u.ac.jp FIS - 2013 1 Some Words on the Survey Not enough answers to say anything meaningful! Try again: survey. FIS - 2013 2 Last

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Power-Expected-Posterior Priors for Variable Selection in Gaussian Linear Models

Power-Expected-Posterior Priors for Variable Selection in Gaussian Linear Models Power-Expected-Posterior Priors for Variable Selection in Gaussian Linear Models D. Fouskakis, I. Ntzoufras and D. Draper December 1, 01 Summary: In the context of the expected-posterior prior (EPP) approach

More information

Variable Selection in Data Mining Project

Variable Selection in Data Mining Project Variable Selection Variable Selection in Data Mining Project Gilles Godbout IFT 6266 - Algorithmes d Apprentissage Session Project Dept. Informatique et Recherche Opérationnelle Université de Montréal

More information

Finite Population Estimators in Stochastic Search Variable Selection

Finite Population Estimators in Stochastic Search Variable Selection 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 Biometrika (2011), xx, x, pp. 1 8 C 2007 Biometrika Trust Printed

More information

Extended Bayesian Information Criteria for Gaussian Graphical Models

Extended Bayesian Information Criteria for Gaussian Graphical Models Extended Bayesian Information Criteria for Gaussian Graphical Models Rina Foygel University of Chicago rina@uchicago.edu Mathias Drton University of Chicago drton@uchicago.edu Abstract Gaussian graphical

More information

Gaussian Graphical Models and Graphical Lasso

Gaussian Graphical Models and Graphical Lasso ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017 Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf

More information

Discriminant Analysis with High Dimensional. von Mises-Fisher distribution and

Discriminant Analysis with High Dimensional. von Mises-Fisher distribution and Athens Journal of Sciences December 2014 Discriminant Analysis with High Dimensional von Mises - Fisher Distributions By Mario Romanazzi This paper extends previous work in discriminant analysis with von

More information

Studentization and Prediction in a Multivariate Normal Setting

Studentization and Prediction in a Multivariate Normal Setting Studentization and Prediction in a Multivariate Normal Setting Morris L. Eaton University of Minnesota School of Statistics 33 Ford Hall 4 Church Street S.E. Minneapolis, MN 55455 USA eaton@stat.umn.edu

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Bayesian Model Comparison

Bayesian Model Comparison BS2 Statistical Inference, Lecture 11, Hilary Term 2009 February 26, 2009 Basic result An accurate approximation Asymptotic posterior distribution An integral of form I = b a e λg(y) h(y) dy where h(y)

More information

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33 Hypothesis Testing Econ 690 Purdue University Justin L. Tobias (Purdue) Testing 1 / 33 Outline 1 Basic Testing Framework 2 Testing with HPD intervals 3 Example 4 Savage Dickey Density Ratio 5 Bartlett

More information

Simultaneous Multi-frame MAP Super-Resolution Video Enhancement using Spatio-temporal Priors

Simultaneous Multi-frame MAP Super-Resolution Video Enhancement using Spatio-temporal Priors Simultaneous Multi-frame MAP Super-Resolution Video Enhancement using Spatio-temporal Priors Sean Borman and Robert L. Stevenson Department of Electrical Engineering, University of Notre Dame Notre Dame,

More information

High-dimensional regression

High-dimensional regression High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and

More information

Default Priors and Effcient Posterior Computation in Bayesian

Default Priors and Effcient Posterior Computation in Bayesian Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature

More information

Discussion of Brock and Durlauf s Economic Growth and Reality by Xavier Sala-i-Martin, Columbia University and UPF March 2001

Discussion of Brock and Durlauf s Economic Growth and Reality by Xavier Sala-i-Martin, Columbia University and UPF March 2001 Discussion of Brock and Durlauf s Economic Growth and Reality by Xavier Sala-i-Martin, Columbia University and UPF March 2001 This is a nice paper that summarizes some of the recent research on Bayesian

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding Patrick Breheny February 15 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/24 Introduction Last week, we introduced penalized regression and discussed ridge regression, in which the penalty

More information

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units Sahar Z Zangeneh Robert W. Keener Roderick J.A. Little Abstract In Probability proportional

More information

A Bayesian perspective on GMM and IV

A Bayesian perspective on GMM and IV A Bayesian perspective on GMM and IV Christopher A. Sims Princeton University sims@princeton.edu November 26, 2013 What is a Bayesian perspective? A Bayesian perspective on scientific reporting views all

More information

Partial factor modeling: predictor-dependent shrinkage for linear regression

Partial factor modeling: predictor-dependent shrinkage for linear regression modeling: predictor-dependent shrinkage for linear Richard Hahn, Carlos Carvalho and Sayan Mukherjee JASA 2013 Review by Esther Salazar Duke University December, 2013 Factor framework The factor framework

More information

THE MEDIAN PROBABILITY MODEL AND CORRELATED VARIABLES

THE MEDIAN PROBABILITY MODEL AND CORRELATED VARIABLES Submitted to the Annals of Statistics THE MEDIAN PROBABILITY MODEL AND CORRELATED VARIABLES By Maria M. Barbieri, James O. Berger Edward I. George and, Veronika Ročková,, 17 August 2018 Università Roma

More information