Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Similar documents
Or How to select variables Using Bayesian LASSO

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models

Machine Learning for OR & FE

MS-C1620 Statistical inference

Sparse Linear Models (10/7/13)

Fast Regularization Paths via Coordinate Descent

Linear regression methods

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Prediction & Feature Selection in GLM

Comparisons of penalized least squares. methods by simulations

Linear Model Selection and Regularization

A Bayesian Approach to Phylogenetics

Proteomics and Variable Selection

Consistent high-dimensional Bayesian variable selection via penalized credible regions

On High-Dimensional Cross-Validation

Bayesian Grouped Horseshoe Regression with Application to Additive Models

A Confidence Region Approach to Tuning for Variable Selection

Markov Chain Monte Carlo methods

Stability and the elastic net

Computational statistics

Bayesian Methods for Machine Learning

Regression, Ridge Regression, Lasso

Fast Regularization Paths via Coordinate Descent

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR

ESL Chap3. Some extensions of lasso

Bayesian linear regression

DATA MINING AND MACHINE LEARNING

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

Penalized Loss functions for Bayesian Model Choice

Generalized Elastic Net Regression

S-PLUS and R package for Least Angle Regression

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Applied Machine Learning Annalisa Marsico

Shrinkage Methods: Ridge and Lasso

Analysis Methods for Supersaturated Design: Some Comparisons

An Improved 1-norm SVM for Simultaneous Classification and Variable Selection

Regularization: Ridge Regression and the LASSO

Statistical Inference

Lecture 14: Shrinkage

Lecture 5: Soft-Thresholding and Lasso

Regularization Path Algorithms for Detecting Gene Interactions

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

1 Introduction. 2 AIC versus SBIC. Erik Swanson Cori Saviano Li Zha Final Project

On Markov chain Monte Carlo methods for tall data

Bayesian Linear Regression

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

The lasso: some novel algorithms and applications

Making Our Cities Safer: A Study In Neighbhorhood Crime Patterns

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

Machine Learning Linear Classification. Prof. Matteo Matteucci

The OSCAR for Generalized Linear Models

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR

Part 8: GLMs and Hierarchical LMs and GLMs

ISyE 691 Data mining and analytics

Tuning Parameter Selection in L1 Regularized Logistic Regression

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

Genetic Networks. Korbinian Strimmer. Seminar: Statistical Analysis of RNA-Seq Data 19 June IMISE, Universität Leipzig

ST 740: Markov Chain Monte Carlo

Other likelihoods. Patrick Breheny. April 25. Multinomial regression Robust regression Cox regression

Regression Shrinkage and Selection via the Lasso

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Compressed Sensing in Cancer Biology? (A Work in Progress)

Fast Regularization Paths via Coordinate Descent

Regularization Paths

MCMC algorithms for fitting Bayesian models

Regularized Discriminant Analysis and Its Application in Microarrays

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

Statistical Methods for Data Mining

Advanced Statistical Modelling

Contents. Part I: Fundamentals of Bayesian Inference 1

Learning with Sparsity Constraints

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Statistical aspects of prediction models with high-dimensional data

Generalized Linear Models and Its Asymptotic Properties

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Univariate shrinkage in the Cox model for high dimensional data

A simulation study of model fitting to high dimensional data using penalized logistic regression

Bi-level feature selection with applications to genetic association

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

Dimension Reduction Using Rule Ensemble Machine Learning Methods: A Numerical Study of Three Ensemble Methods

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

STA 4273H: Statistical Machine Learning

An Introduction to Graphical Lasso

application in microarrays

Iterative Selection Using Orthogonal Regression Techniques

STA414/2104 Statistical Methods for Machine Learning II

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Lecture 14: Variable Selection - Beyond LASSO

Reducing The Computational Cost of Bayesian Indoor Positioning Systems

Monte Carlo in Bayesian Statistics

Chapter 3. Linear Models for Regression

MS&E 226: Small Data

Robust Variable Selection Through MAVE

Coupled Hidden Markov Models: Computational Challenges

Discussion of Least Angle Regression

Variable Selection under Measurement Error: Comparing the Performance of Subset Selection and Shrinkage Methods

Bayesian Methods in Multilevel Regression

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Transcription:

Model-Averaged l 1 Regularization using Markov Chain Monte Carlo Model Composition Technical Report No. 541 Department of Statistics, University of Washington Chris Fraley and Daniel Percival August 22, 2008, revised May 14, 2010 Abstract This paper studies combining l 1 regularization and Markov chain Monte Carlo model composition techniques for Bayesian model averaging (BMA). The main idea is to resolve the model uncertainty issues arising from path point selection by treating the l 1 regularization path as a model space for BMA. The method is developed for linear and logistic regression, and applied to sample classification in four different biomedical datasets. 1 Introduction In regression and classification problems, the goal is to predict a response from inputs, or predictors. In many modern data applications, there are a great number of predictors available. In this case, dimension reduction is critical for model fitting, and often improves prediction accuracy and interpretability. Variable selection is the process of choosing a few predictors to build the regression or classification model. However, most variable selection methods raise the following question: once a set of variables is chosen, how certain are we that this is the correct set? Could another set do nearly as well? These questions are at the heart of model uncertainty. Bayesian model averaging (BMA) seeks to address model uncertainty by taking a weighted average over a class of models under consideration (see Hoeting et al. (1999) for an introduction). BMA techniques require a model space over which to average, and Hoeting et al. identify two approaches for determining the model space. The first approach, due to Volinsky et al. (1997), uses the leaps and bounds algorithm (Furnival and Wilson 1974) to obtain a set of candidate models. The second approach uses Markov chain Monte Carlo model composition (MCMCMC), due to Madigan and York (1995), to directly approximate the posterior distribution. In this paper, we consider l 1 regularization as a method for determining the model space for BMA. Regularization is a popular automatic dimension reduction and variable selection technique. In the case of l 1 penalized linear regression also known as lasso (Tibshirani 1996) one solves the following optimization problem: N ˆβ = argmin i i=1(y β j x ij ) 2 + λ β j, j j Insightful Corporation (now at Insilicos LLC) and Department of Statistics, University of Washington Department of Statistics, Carnegie Mellon University 1

in which λ is a nonnegative penality parameter. For λ = 0, we recover the ordinary least squares (OLS) solution, and for λ sufficiently large we have ˆβ = 0. Below this threshold, each particular choice of λ gives a different estimate ˆβ. This progression of the value of β is the regularization path. As λ grows from zero, the amount of regularization increases. This has the effect of shrinking the entries of β towards zero, and (due to the l 1 penalty) making some entries exactly zero, giving the lasso its popular sparsity property. A similar effect occurs in other settings in particular, for l 1 regularized logistic regression methods. For BMA, the leaps and bounds approach was extended by Yeung et al. (2005) to apply to wide datasets, in which there are many more measurements than samples (p > n in statistical terminology), as is common in bioinformatics applications. Fraley and Seligman (2010) replace the models obtained by leaps and bounds with those defined by the l 1 regularization path, yielding a method suitable for wide as well as narrow datasets. In this paper we treat the l 1 regularization path as a model space for MCMCMC, and develop a combination technique of regularization and model averaging with the aim of resolving the model uncertainty issues arising from path point selection. In this paper, we first give a review of the general Markov Chain Monte Carlo Mondel Composition for BMA. We next outline a version of MCMCMC which uses the l 1 regularization path as the model space. We finally present examples of applications of this method. 2 Model Averaging using Regularization Paths 2.1 Markov Chain Monte Carlo Model Composition Suppose we are given data D, and a model space M, and wish to compute a quantity. For example, in a regression problem, can be a single entry of β or the probability that a particular entry of β is zero. The posterior distribution of is given by: P( D) = M M P( M, D)P(M D). We can estimate this quantity via MCMC by constructing a Markov chain with equilibrium distribution P(M D) the posterior distribution of the entire model space. This process is called Markov Chain Monte Carlo Model Composition (MCMCMC, or MC 3 ). Indexing the chain as {M(t)}, t = 1, 2...T, we can then estimate by taking the following average: = 1 T N g(m(t)). t=1 Where g(m(t)) calculates the quantity for the model M(t). This quantity will converge almost surely by the law of large numbers to the true value of. Such a chain with the desired equilibrium property can be constructed via the Metropolis- Hastings (M-H) algorithm. Consider a model N M. At each step in the M-H algorithm, we must propose a new model N M. This is done via a proposal distribution: P(N N). One way to construct such a distribution is to first define a neighborhood around N, and then propose a new model from this neighborhood. For example, in the case of linear regression, the set of models using all possible subsets of a set of candidate covariates can be considered the model space. Given a particular model, all models with one additional or fewer covariate included in it could be considered to be the neighborhood of that model in this model space. 2

Once we have constructed our proposal distribution, and drawn from it, this new proposal is then accepted with probability: ( P(Accept N ) = min 1, P(N N))P(N D) P(N N )P(N D) Note that if the proposal distribution is symmetric, e.g. P(N N) = P(N N ), then this is simply the ratio of posterior model probabilities. If we accept N, then N is now considered as N for the next iteration, and a new N is proposed conditional on N. Otherwise, the original model N is kept and another N is proposed as before. The equilibrium distribution of this chain is then P(M D). There are two difficulties in this setup. First, we must construct a sensible proposal distribution. In particular, the concept of neighborhood in a model space can be hard to define. Later, we will see that this idea can be made quite natural in the context of a regularization path. The second difficulty is calculating the ratio of posterior model probabilities P(N D). This requires the ). P(N D) computation of integrated likelihood values for each model, a notoriously difficult problem. 2.2 Combination Algorithm We propose to combine MCMCMC and regularization by using the l 1 regularization path as the model space M for MCMCMC. In the case of BMA when applied to a model space generated by l 1 regularization, MCMCMC has particular appeal since the metric on this model space is quite natural. The l 1 regularization path can be represented by the interval [0, λ max ], which is simply the range of possible λ penalization values i.e., a point α (0, λ max ) means that: N ˆβ α = argmin i i=1(y j β j x ij ) 2 + α j β j. We can thus write M α to represent a model from this space at the point λ = α along the l 1 regularization path. Under this representation, λ max represents the intercept only model and 0 represents the unpenalized regression. A similar setup can be used for the elastic net regularization technique (Zou and Hastie 2005). Since neighborhoods on the real line are intuitive concepts, this makes this model space framework quite natural. Our combination algorithm then proceeds as follows: 1. Begin by choosing a starting point α [0, λ max ]. This may be done randomly, or in a deterministic way. 2. Propose a new α (discussed below) ( 3. Calculate P(Accept M α ) = min 1, P(M α D) P(M α D) 4. Draw a single U(0, 1) to determine acceptance. Set α = α if accepted, keep α = α otherwise. 5. Repeat steps 2-4 a large number of times. ) 6. Discard the first portion of this sequence of α values, and use the remaining tail as an approximation of the model space distribution (this is the distribution of the single parameter α). 3

The MCMCMC method described was implemented for elastic net paths for both normal linear regression and for binomial family generalized linear models. The elastic net models were fit using the R package glmnet (Friedman et al. 2010). Note that l 1 regularization is a special case of the elastic net. There are several issues that arise with this algorithm which we now discuss. Choosing a proposal distribution. In the second step, the proposal distribution must be adjusted to ensure good properties. This raises three main concerns - (1) the acceptance rate, how often a proposed model (α) is accepted, (2) the autocorrelation of the samples (high autocorrelation is not desired) and (3) the potential for the chain to become stuck at a certain point in the middle, or somehow be forced to the ends. Using either one of a U(α a, α+a) or a N(α, a) proposal distribution can produce good results with respect to concerns (1) and (2). However, concern (3) is not fully addressed, since this scheme has the potential to become trapped at the ends of the interval representing the regularization path. When the chain is near the end, a significant part of the density of either proposal distribution lies outside of the space. If one of these points is chosen, it is simply mapped to the border. This creates an attracting effect. There are two main ways in which this attracting effect could be resolved. First, a new proposal distribution could be devised, in which the distributions could be truncated to exclude any support outside of the model space. This approach produces the opposite effect: the endpoints now are repelling. Another possibility is to alter the truncation to keep the distributions symmetric around the current model. However, this only serves to enhance the attracting effect of the endpoints. A second approach is to re-parameterize the model space to be constraint free, which can be accomplished by a transformation such as a logit transform. A drawback is that there remain issues with the endpoints, which are distorted to occupy large areas of the real line by the transform, creating an attracting effect. For now, the best approach to the problem is to choose the tuning parameter a wisely. This choice must take into account both the attracting concern and the number of modes in the posterior distribution of the model space. If the posterior is suspected to be unimodal, then a small tuning parameter is acceptable, and can be chosen to be small enough to avoid ever reaching one of the endpoints. If the posterior has many modes, then a large tuning parameter can allow the chain to jump between modes easily. Note that the notion of small and large is relative to the value of λ max. Approximating Integrated Likelihood. Another major issue in this algorithm is computing the quantity P(M α D) for use in step 3. Noting that P(M D) P(M, D)P(M), there are two issues P(M α D) in this computation. First, a prior must be placed on the model space. At this time, we use a flat prior, so that all models are considered a priori equally likely. This effectively eliminates these terms from all calculations. Second, the integrated likelihood (P(M, D)) is a notoriously difficult quantity to compute. At this time, we use the BIC approximation. The main way to improve upon the BIC approximation is to apply some sort of Monte Carlo integration technique. As noted in Tibshirani (1996), the l 1 penalized regression model corresponds to a Bayesian regression model with independent double exponential priors on each coordinate of β, using the posterior mode as the coefficient estimates. This gives us a basic setup for a Monte Carlo integration. However, in many cases this is a very high dimensional integral. Bayesian approaches to lasso have been the subject of recent research (Park and Casella 2008, Hans 2009). Length of Chains and Mixing. As in all Monte Carlo techniques, the length of the chain (number of times steps 2-4 are repeated in the algorithm) and the convergence of the chain are important concerns. In our examples we found that 10,000 iterations, with the first 2500 discarded 4

were sufficient. A related concern to these is autocorrelation in the chain. We suggest using regressions on the sequence of α on lagged sequences. Choosing the lowest lag value which gives a small R 2 will roughly eliminate autocorrelation. 2.3 Processing MCMCMC Results Post-processing of an MCMCMC analysis is straightforward. In the algorithm described above, we store at each step the coefficients of the accepted point (either α if it was accepted or α if α was rejected). These stored values can be used for all calculations by simple averaging or taking quantiles. Some items of interest that are easily obtained: The posterior model space distribution. An estimate of the posterior distribution of the model space can be obtained using the α values stored in the tail. This can be done via a density estimator. In the examples in the next section, we will use a histogram of the tail α values. The probability of a coefficient being equal to zero. This can be found by simply computing the fraction of models in the tail with that coefficient equal to zero. The value of a coefficient. This can be found by averaging over all of the stored values of that coefficient in the tail. Fitted values. Fitted values can be produced by using the average coefficient values obtained above. Binomial regression problems can be further investigated by applying a decision boundary to produce classes. Credible intervals for a coefficient. These can be found by taking quantiles of the coefficient values stored in the tail. 5

3 Examples 3.1 Example 1: Diabetes Data The Diabetes dataset used in Efron et al. 2004 is used to demonstrate the algorithm applied a linear regression problem. An l 1 penalized linear regression model was fit to the data using glmnet. The usual coefficient plot and cross-validation results are displayed in Figure 1. According to the cross validation results, the entire last half of the regularization path could contain a good choice for a single point. This displays a degree of model uncertainty that would encourage application of our combination technique. MCMCMC was run for 10,000 iterations, with the first 2500 discarded to allow time for convergence. For analysis, only every 6th draw from this tail was used to reduce autocorrelation. 59% of all moves in the model space were accepted, and of these 86% were moves made with probability one. The MSE for the averaged model was 2884.7. This was obtained by taking the coordinate-wise mean of the β vector over the tail of the Markov chain. The MSE for an OLS fit of the data is 2933. This process thus gives a marginal improvement. Coefficient Plot 0 2 4 6 8 10 10 Cross Validation (MSE) Coefficients 500 0 500 3000 4000 5000 6000 0 500 1500 2500 L1 Norm 0.2 0.4 0.6 0.8 1.0 Histogram of Posterior Model Space Distribution BIC Path BIC 3600 3800 4000 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 Figure 1: Results of the MCMCMC algorithm. Top left: coefficient path plot, produced by glmnet. Top right: cross validation of MSE over the l 1 regulaization path. Bottom left: histogram of the posterior distribution over the model space (P(M D) ). Bottom right: plot of BIC over path. 6

The remaining examples are all binomial family generalized linear models. 3.2 Example 2: South African Heart Disease Data The South African Heart Disease data was used by Park and Hastie (2007) to demonstrate the glmpath algorithm. This data set consists of nine covariates, with the response being the presence of heart disease - a binary covariate. Cross validation on an l 1 -regularized fit of the data suggests model uncertainty issues, see Figure 2. The MCMCMC algorithm was run for 10,000 iterations, with the first 2500 discarded to allow time for convergence. 54% of all proposed moves were accepted, with 99% of these made with probability one. 332 of the 462 cases fit were predicted correctly using the mean coefficient vector and a decision boundary at.5. This is comparable to standard techniques. Coefficient Plot 0 3 4 5 5 6 7 Cross Validation (Likelihood) Coefficients 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0 1.2 L1 Norm 1.00 1.10 1.20 1.30 0.2 0.4 0.6 0.8 1.0 Histogram of Posterior Model Space Distribution BIC Path BIC 600 700 800 900 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 Figure 2: Results of the MCMCMC algorithm. Top left: coefficient path plot, produced by glmnet. Top right: cross validation of log-likelihood over the l 1 regularization path. Bottom left: histogram of the posterior distribution over the model space (P(M D) ). Bottom right: plot of BIC over the path. 7

3.3 Example 3: Leukemia Data The Leukemia data was used by Yeung et al. (2005) to demonstrate the iterative Bayesian model averaging algorithm for wide datasets. We only consider the two class version of the data set. The data set consists of 38 training samples and 34 test samples. There are 3051 covariates, each representing a gene expression. The response is the presence of leukemia. Cross validation over the l 1 regularization path again shows potential model uncertainty issues. The MCMCMC algorithm was run for 10,000 iterations, with the first 2500 discarded to allow time for convergence. 53% of proposed moves were accepted, with 92% of these moves made with probability one. Every 5th observation of the tail was used to reduce autocorrelation. See Figure 3 for plots of the results. The averaged model predicted 33 of 34 of the cases in the test set. As in Yeung et al. (2005), we also evaluate performance using Brier score: Brier Score := 1 N N (Y Ŷ )2 i=1 The Brier score for this model was 1.385. Yeung et al. achieved a score of 1.5, with 33 of 34 test cases predicted correctly. Alternately, we can drop all covariates with P(β i = 0) <.5 so that our solution is sparse. This strategy leaves 18 genes (compared to 20 in Yeung et al.) plus an intercept. This model predicts 31 of 34 test cases correctly, and has a Brier score of 1.876. Coefficient Plot 0 6 11 17 18 20 Cross Validation (Likelihood) Coefficients 0.5 0.0 0.5 1.0 1.5 0.4 0.8 1.2 0 2 4 6 8 10 L1 Norm 0.0 0.2 0.4 0.6 0.8 1.0 Histogram of Posterior Model Space Distribution BIC Path BIC 60 80 100 120 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Figure 3: Results of the MCMCMC algorithm. Top left: coefficient path plot, produced by glmnet. Top right: cross validation of log-likelihood over the l 1 regularization path. Bottom left: histogram of the posterior distribution over the model space (P(M D) ). Bottom right: plot of BIC over the path. 8

3.4 Example 4: Breast Cancer Data The breast cancer data was also used by Yeung et al. (2005) to demonstrate iterative BMA. This dataset contains about 10,000 covariates, 78 training cases and 19 test cases. An l 1 regularized fit of the data, with optimal point chosen by AIC or BIC, gives very poor performance on the test data. The MCMCMC algorithm was run for 10,000 iterations, with the first 2500 discarded to allow time for convergence. 53% of proposed moves were accepted, with 92% of these moves made with probability one. Every 5th observation of the tail was used to reduce autocorrelation. See Figure 4 for plots of the results. The averaged model predicts 11 of 19 of the test cases correctly, with a Brier score of 4.79. This suggests a weakness in this technique. The poor performance of this technique on this dataset may be due to poor performance of the l 1 regularization on the data. Coefficient Plot 0 10 18 25 34 37 Cross Validation (Likelihood) Coefficients 0 5 10 0.6 0.8 1.0 1.2 1.4 0 10 20 30 40 50 L1 Norm 0.0 0.2 0.4 0.6 0.8 1.0 Histogram of Posterior Model Space Distribution BIC Path BIC 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 100 150 200 250 Figure 4: Results of the MCMCMC algorithm. Top left: coefficient path plot, prod uced by glmnet. Top right: cross validation of log-likelihood over the LASSO path. Bottom left: histogram of the posterior distribution over the model space (P(M D) ). Bottom right: plot of BIC over the LASSO path. 9

4 Future Work There are several areas of potential improvement in this combination algorithm. First, there are issues with the proposal distribution in the Metropolis-Hastings step. The most significant of these is a tendency to become trapped near the edges of the regularization path. A new proposal distribution or an automatic way to choose the tuning parameter for the distribution could solve this issue. Re-parameterization of the model space could also help. Second, the BIC approximation is fast, but it is in many cases poor. BIC approximations are worst when the number of observations is very small compared to the number of covariates. This is a situation where regularization techniques are very desirable. Therefore, we would prefer that the algorithm did not rely so heavily on BIC. Computing integrated likelihood is a tough problem, so this improvement will probably be the most challenging to make. Third, the algorithm did not work well compared to other methods on the fourth example data set. We suspect that this was a result of poor performance of l 1 regularization on the data set. A closer investigation will hopefully reveal the true cause, and will certainly aid in more appropriate application of the algorithm. Fourth, the question of the model space prior has not been fully addressed. Though the flat prior is easy to use, there are others that may be more appropriate. For example, choosing a prior that prefers sparse models could be of some benefit. Finally, it is worth noting that a simple direct use of the BIC curve over the regularization path may be as effective as the MCMCMC algorithm. This would be done by simply using BIC to take a weighted average. As in the MCMCMC, BIC could be replaced by a better approximation of the integrated likelihood. Acknowledgements This work was supported by National Institutes of Health SBIR awards 2R44GM074313-02 and 3R44GM074313-04S1. References B. Efron, T. Hastie, and R. Tibshirani. Least angle regression. The Annals of Statistics, 32:407 451, May 2004. C. Fraley and M. Seligman. Model-averaged l 1 regularization. Technical report, Insilicos LLC, 2010. (in preparation). J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33, 2010. G. M. Furnival and R. W. Wilson. Regression by leaps and bounds. Technometrics, 16:499 511, 1974. C. Hans. Bayesian lasso regression. Biometrika, 103:835 845, 2009. J. A. Hoeting, D. Madigan, A. E. Raftery, and C. T. Volinsky. Bayesian model averaging: a tutorial. Statistical Science, 14(4):382 417, 1999. D. Madigan and J. York. Bayesian graphical models for discrete data. International Statistical Review, 63:215 232, 1995. 10

M.-Y. Park and T. Hastie. An L 1 regularization path algorithm for generalized linear models. Journal of the Royal Statistics Society Series B, 69:659 677, 2007. T. Park and G. Casella. The Bayesian lasso. Journal of the American Statistical Association, 103: 681 686, 2008. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistics Society Series B, 58:385 395, 1996. C. T. Volinsky, D. Madigan, A. E. Raftery, and R. A. Kronmal. Bayesian model averaging in proportional hazard models: Assessing stroke risk. Journal of the Royal Statistical Society, Series C Applied Statistics, 46:433 448, 1997. K.-Y. Yeung, R. E. Bumgarner, and A. E. Raftery. Bayesian model averaging: Development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics, 21:2394 2402, 2005. H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67:301 320, 2005. 11