Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

Similar documents
arxiv: v1 [stat.me] 30 Mar 2015

arxiv: v4 [stat.me] 23 Mar 2016

Sparse Linear Models (10/7/13)

arxiv: v1 [stat.me] 6 Jul 2017

Density Estimation. Seungjin Choi

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian leave-one-out cross-validation for large data sets

Nonparametric Bayesian Methods (Gaussian Processes)

Lecture : Probabilistic Machine Learning

Probabilistic programming and Stan. mc-stan.org

STA414/2104 Statistical Methods for Machine Learning II

STA 4273H: Sta-s-cal Machine Learning

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian linear regression

Bayesian methods in economics and finance

Estimating Sparse High Dimensional Linear Models using Global-Local Shrinkage

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Machine Learning for OR & FE

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Part 2: Multivariate fmri analysis using a sparsifying spatio-temporal prior

R-squared for Bayesian regression models

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics

Aki Vehtari, Aalto University

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Lecture 14: Shrinkage

Introduction to Gaussian Process

Statistical Data Mining and Machine Learning Hilary Term 2016

CPSC 540: Machine Learning

Regularized Regression A Bayesian point of view

Statistical Inference

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Bayesian model selection: methodology, computation and applications

STA 4273H: Statistical Machine Learning

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Bayesian Regression Linear and Logistic Regression

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 2 Machine Learning Review

COS513 LECTURE 8 STATISTICAL CONCEPTS

Introduction to Probabilistic Machine Learning

y Xw 2 2 y Xw λ w 2 2

Active and Semi-supervised Kernel Classification

Penalized Loss functions for Bayesian Model Choice

Dimension Reduction Methods

Recent Advances in Bayesian Inference Techniques

Parametric Techniques Lecture 3

Pattern Recognition and Machine Learning

Introduction to Gaussian Processes

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Hierarchical Models & Bayesian Model Selection

GWAS V: Gaussian processes

STAT 518 Intro Student Presentation

The connection of dropout and Bayesian statistics

A Review of Pseudo-Marginal Markov Chain Monte Carlo

GAUSSIAN PROCESS REGRESSION

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Parametric Techniques

MS-C1620 Statistical inference

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Lecture 4: Probabilistic Learning

Machine Learning Linear Classification. Prof. Matteo Matteucci

Nonparameteric Regression:

Partial factor modeling: predictor-dependent shrinkage for linear regression

g-priors for Linear Regression

Bayesian Methods for Machine Learning

Introduction. Chapter 1

CS Homework 3. October 15, 2009

Contents. Part I: Fundamentals of Bayesian Inference 1

Model Selection for Gaussian Processes

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)

STA 4273H: Statistical Machine Learning

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Variational Dropout via Empirical Bayes

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Integrated Non-Factorized Variational Inference

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Disease mapping with Gaussian processes

Pareto Smoothed Importance Sampling

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Introduction to Statistical modeling: handout for Math 489/583

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

GWAS IV: Bayesian linear (variance component) models

High-Dimensional Statistical Learning: Introduction

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

Bayesian Linear Regression

36-720: The Rasch Model

Principles of Bayesian Inference

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Regression, Ridge Regression, Lasso

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Principles of Bayesian Inference

Linear model selection and regularization

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

PMR Learning as Inference

PATTERN RECOGNITION AND MACHINE LEARNING

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Machine Learning Linear Regression. Prof. Matteo Matteucci

Transcription:

Aki Vehtari, Aalto University, Finland Probabilistic machine learning group, Aalto University http://research.cs.aalto.fi/pml/ Bayesian theory and methods, approximative integration, model assessment and selection, Gaussian processes, digital health, and disease risk prediction

Outline for this talk Joint work with Juho Piironen Large p, small n (generalized) linear regression Priors for large p, small n (generalized) linear regression Bayesian predictive model selection and reduction

Generalized linear models Gaussian linear model for y R y i = β T x i + ε i, ε i N(, σ 2 ), i = 1,..., n Logistic regression (binary classification) for y {, 1} 1 P(y i = 1) = 1 + exp( β T x i )

Large p, small n regression Linear or generalized linear regression number of covariates p number of observations n Large p, small n common e.g. in modern medical/bioinformatics studies (e.g. microarrays, GWAS) brain imaging in our examples p is around 1e2 1e5, and usually n < 1

Large p, small n regression If noiseless observations we can fit uniquely identified regression in n 1 dimensions If noisy observations, more complicated If correlating covariates, more complicated

Large p, small n regression Priors! Non-sparse priors assume most covariates are relevant, but may have strong correlations factor models Sparse priors assume only small number of covariates effectively non-zero m eff n

Example (Intercept) x1 x2 x3 x4 x5 x6 x7 x8 x9 x1 x11 x12 x13 x14 x15 x16 x17 x18 x19 x2 sigma (Intercept) x1 x2 x3 x4 x5 x6 x7 x8 x9 x1 x11 x12 x13 x14 x15 x16 x17 x18 x19 x2 sigma 2 1 1 2 Gaussian prior 2 1 1 2 3 Horseshoe prior rstanarm + bayesplot

Example Gaussian vs. Horseshoe predictive performance using cross-validation (loo package) > compare ( loog, loohs ) e l p d _ d i f f se 7.9 2.8

Large p, small n regression Sparse priors assume only small number of covariates effectively non-zero m eff n Laplace prior ( Bayesian lasso ) computationally convenient (continuous and log-concave), but not really sparse spike-and-slab (with point-mass at zero) prior on number of non-zero covariates, discrete Horseshoe and hierarchical shrinkage priors prior on amount of shrinkage, continuous

Continuous vs. discrete prior Spike and slab prior (with point-mass at zero) has mix of continuous prior and probability mass at zero parameter space is mixture of continuous and discrete Hierarchical shrinkage and horseshoe priors are continuous assumes varying amount of shrinkage Markov chain Monte Carlo inference can benefit from the geometry Hamiltonian Monte Carlo

Horseshoe prior Linear regression model with covariates x = (x 1,..., x D ) ( y i = β T x i + ε i, ε i N, σ 2), i = 1,..., n, The horseshoe prior: ( β j λ j, τ N, λ 2 j τ 2), λ j C + (, 1), j = 1,..., D. The global parameter τ shrinks all β j towards zero The local parameters λ j allow some β j to escape the shrinkage.45 2 p(β 1 ).3 β 2.15 2 2 2 2 2 β 1 β 1

Horseshoe prior Given the hyperparameters, the posterior mean satisfies approximately β j = (1 κ j )β ML j, κ j = 1 1 + nσ 2 τ 2 λ 2 j, where κ j is the shrinkage factor With λ j C + (, 1), the prior for κ j looks like: nσ 2 τ 2 = 1..5 1 κ

The global shrinkage parameter τ Effective number of nonzero coefficients m eff = D (1 κ j ) j=1 The prior mean can be shown to be E[m eff τ, σ] = τσ 1 n 1 + τσ 1 n D Setting E[m eff τ, σ] = p (prior guess for the number of nonzero coefficients) yields for τ τ = p D p σ n Prior guess for τ based on our beliefs about the sparsity

Illustration p(τ) vs. p(m eff ) Let n = 1, σ = 1, p = 5, τ = p D p σ n, D = dimensionality p(m eff ) with different choices of p(τ): τ = τ τ N +( ), τ 2 τ C +( ), τ 2 τ C + (, 1) D = 1 2 4 6 8 1 2 4 6 8 1 2 4 6 8 1 2 4 6 8 1 D = 1 5 1 15 2 5 1 15 2 2 4 6 8 1 2 4 6 81,

Non-Gaussian observation models The reference value: τ = p D p σ n The framework can be applied also to non-gaussian observation models by deriving appropriate plug-in values for σ Gaussian approximation to the likelihood E.g. σ = 2 for logistic regression

Horseshoe in rstanarm Easy in rstanarm p <- 5 tau <- p/(d-p) * 1/sqrt(n) prior_coeff <- hs(df=1, global_df=1, global_scale=tau) fit <- stan_glm(y x, gaussian(), prior = prior_coeff, adapt_delta =.999)

Experiments Table: Summary of the real world datasets, D denotes the number of predictors and n the dataset size. Dataset Type D n Ovarian Classification 1536 54 Colon Classification 2 62 Prostate Classification 5966 12 ALLAML Classification 7129 72 Corn (4 targets) Regression 7 8

Effect of p(τ) on parameter estimates τ N +( ), τ 2 τ C +( ), τ 2 τ C + (, 1) Ovarian cancer data (n = 54, D = 1536): 4 3 2 1.1.2 τ 5 1 15 m eff 5 1, 1,5 Variable j 8 6 4 2.5.1 τ 2 4 6 8 1 m eff 5 1, 1,5 Variable j 4 3 2 1 2 4 6 8 1 τ 5 1, 1,5 m eff 5 1, 1,5 Variable j Choose τ according to a prior guess p = 3. Prior and posterior samples for τ and m eff, and absolute posterior mean coefficients β j.

Effect of p(τ) on prediction accuracy (1/2) Ovarian Colon Prostate ALLAML 1 3 1 3 1 3 1 3 m eff 1 1 1 1 1 1 1 1 MLPD Time (min) 1 1.25.3.35 1 1 1 1 1 3 1 1 1 1 1 3 4 3 2 1 1 1 1 1 1 3 p 1 1 1 1 1 1 1 3.4.45.5.55 1 1 1 1 1 3 6 4 2 1 1 1 1 1 3 p 1 1 1 1 1 1 1 3.2.25.3.35 1 1 1 1 1 3 3 2 1 1 1 1 1 1 3 p For various p transformed into τ, with τ N +(, τ 2 ) (red) and τ C + (, τ 2 ) (yellow): 1 1.2.25.3.35 1 1 1 1 1 3 1 1 1 1 1 3 4 3 2 1 1 1 1 1 1 3 p Posterior mean m eff, mean log predictive density (MLPD) on test data (dashed line denotes LASSO), and computation time.

Effect of p(τ) on prediction accuracy (2/2) Posterior mean m eff, mean log predictive density (MLPD) on test data (dashed line denotes LASSO), and computation time. 1 3 Corn-Moisture Corn-Oil Corn-Protein Corn-Starch 1 3 1 3 1 3 m eff 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 3 1 1 1 1 1 3 1 1 1 1 1 3 MLPD Time (min).4.6.8 1 1.2 15 1 5 1 1 1 1 1 3 1 1 1 1 1 3.8 1 1.2 1.4 15 1 5 1 1 1 1 1 3 1 1 1 1 1 3.5 1 1.5 15 1 5 1 1 1 1 1 3 1 1 1 1 1 3.6.8 1 1.2 1.4 15 1 5 1 1 1 1 1 3 1 1 1 1 1 3 p p p p For various p transformed into τ, with τ N +(, τ 2 ) (red) and τ C + (, τ 2 ) (yellow):

Summary of hyperprior choice for horseshoe prior The global shrinkage parameter τ effectively determines the level of sparsity The prior for p(τ) can have a significant effect on the inference results Our framework allows the modeller to calibrate the prior for τ based on the prior beliefs about the sparsity The concept of effective number of nonzero regression coefficients m eff could be applied also to other shrinkage priors Juho Piironen and Aki Vehtari (217). On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior. Proceedings of the 2th International Conference on Artificial Intelligence and Statistics, PMLR 54:95-913.

Why model selection? Assume a model rich enough capturing lot of uncertainties Bayesian theory says to integrate over all uncertainties model criticism and predictive assessment done if we are happy with the model, no need for model selection Radford Neal won the NIPS 23 feature selection competition using Bayesian methods with all the features (5 1 ) Box: All models are wrong, but some are useful there are known unknowns and unknown unknowns Model selection what if some smaller (or more sparse) or parametric model is practically as good? which uncertainties can be ignored? (e.g. Student-t vs. Gaussian, irrelevant covariates) reduced measurement cost, simpler to explain (e.g. less biomarkers, and easier to explain to doctors)

Marginal posterior probabilities and intervals Marginal posterior probabilities and intervals problems when posterior dependencies, e.g. correlation of covariates weak identifiability of length scale and magnitude.4 horseshoe Laplace Gaussian.2 β.2 1 1 2 3 4 5 55 biomarker from Peltola, Havulinna, Salomaa, and Vehtari (214)

Marginal posterior intervals 3 2 x1 1 x2 x2 1 2 2 2 Marginal posterior intervals 2 1 1 2 3 x1 Joint posterior density rstanarm + bayesplot

Bayes factor and marginal likelihood Marginal likelihood in Bayes factor can be factored using the chain rule p(y M k ) = p(y 1 M k )p(y 2 y 1, M k ),..., p(y n y 1,..., y n 1, M k ) Sensitive to the first terms, and not defined if the prior is improper sensitive to the prior choices especially problematic to use for models with large difference in the number of parameters problematic if none of the models is close to the true model

Predictive model selection Goodness of the model is evaluated by its predictive performance Select a simpler model whose predictive performance is similar to the rich model

Bayesian predictive methods Ways to approximate the predictive performance of using posterior predictive distribution Bayesian cross-validation Watanabe-Akaike Information Criterion reference predictive methods Many other Bayesian predictive methods estimating something else, e.g., DIC L-criterion, posterior predictive criterion See decision theoretic review and more methods in Vehtari, A., and Ojanen, J. (212). A survey of Bayesian predictive methods for model assessment, selection and comparison. Statistics Surveys 6, 142 228. For fast Pareto smoothed importance sampling leave-one-out cross-validation see Aki Vehtari, Andrew Gelman and Jonah Gabry (217). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5):1413 1432. doi:1.17/s11222-16-9696-4. arxiv preprint arxiv:157.4544.

Selection induced bias in variable selection Even if the model performance estimate is unbiased (like LOO-CV), using it for model selection introduces additional fitting to the data Performance of the selection process itself can be assessed using two level cross-validation, but it does not help choosing better models Bigger problem if there is a large number of models as in covariate selection Juho Piironen and Aki Vehtari (217). Comparison of Bayesian predictive methods for model selection. Statistics and Computing, 27(3):711-735. doi:1.17/s11222-16-9649-y. arxiv preprint arxiv:153.865.

Selection induced bias in variable selection n = 2 n = 5 n = 1.5 1.4 1.5 1.5 2.5 2.4 1.8 3.5 25 5 3.3 25 5 2.2 25 5

Selection induced bias in variable selection n = 1 n = 2 n = 4.3.3.3 CV-1.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1.3.3.3 WAIC.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1.3.3.3 DIC.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1.3.3.3 MPP.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1.3.3.3 BMA-ref.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1.3.3.3 BMA-proj.3.3.6.6 25 5 75 1.3.6 25 5 75 1 25 5 75 1 Piironen & Vehtari (217)

Selection induced bias in variable selection n = 1 n = 2 n = 4 CV-1.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1 WAIC.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1 DIC.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1 MPP.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1 BMA-ref.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1 BMA-proj.3.3.3.6.6 25 5 75 1.6 25 5 75 1 25 5 75 1 Piironen & Vehtari (217)

Selection induced bias in variable selection CV-1 / IS-LOO-CV WAIC DIC MPP BMA-ref BMA-proj.2.2.4 1 2 3.2.2 Ionosphere.4 1 2 3.2.2.4 1 2 3.2.2.4 1 2 3.2.2.4 1 2 3.2.2.4 1 2 3.1.1.2.1.1.2.1.1.2.1.1.2.1.1.2.1.1.2 Sonar 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6.2.2.4.2.2.4.2.2.4.2.2.4.2.2.4.2.2.4 Ovarian 5 1 5 1 5 1 5 1 5 1 5 1.2.2 Colon.4 5 1.2.2.4 5 1.2.2.4 5 1.2.2.4 5 1.2.2.4 5 1.2.2.4 5 1 Piironen & Vehtari (217)

Marginal posterior probabilities Marginal posterior probabilities are related to Bayes factor works better than LOO-CV, but submodels are still ignoring uncertainty in the removed parts which leads to overfitting in selection process

Integrate over everything Bayes theory says we should integrate over all uncertainties e.g. integrate over different variable combinations (BMA) or use sparsifying hierarchical shrinkage priors (Intercept) x1 x2 x3 x4 x5 x6 x7 x8 x9 x1 x11 x12 x13 x14 x15 x16 x17 x18 x19 x2 sigma 2 1 1 2 3 If we select a reduced model we are ignoring some uncertainties How to not ignore these uncertainties?

Projective predictive covariate selection, idea The full model posterior distribution is our best representation of the uncertainties What is the best distribution given a constraint that only selected covariates have nonzero coefficient? Optimal projection from the full posterior to a sparse posterior with minimal predictive loss

Projective predictive covariate selection, idea The full model predictive distribution represents our best knowledge about future ỹ p(ỹ D) = p(ỹ θ)p(θ D)dθ, where θ = (β, σ 2 )) and β is in general non-sparse (all β j ) What is the best distribution q (θ) given a constraint that only selected covariates have nonzero coefficient Optimization problem: 1 q = arg min q n n i=1 ( KL p(ỹ i D) ) p(ỹ i θ)q(θ)dθ Optimal projection from the full posterior to a sparse posterior (with minimal predictive loss)

Projective predictive feature selection, computation We have posterior draws {θ s } S s=1, for the full model (θ = (β, σ2 )) and β is in general non-sparse (all β j ) The predictive distribution p(ỹ D) 1 S s p(ỹ θs ) represents our best knowledge about future ỹ Easier optimization problem by changing the order of integration and optimization (Goutis & Robert, 1998): θ s θ s = arg min ˆθ 1 n n i=1 ( ) KL p(ỹ i θ s ) p(ỹ i ˆθ) are now (approximate) draws from the projected distribution

Projection by draws Projection of one Monte Carlo sample can be solved Gaussian case: analytically exponential family case: equivalent to finding the maximum likelihood parameters for the submodel with the observations replaced by the fit of the reference model (Goutis & Robert, 1998; Dupuis & Robert, 23)

Example x1 x2 x3 x4 x1 x5 x6 x7 x14 x8 x9 x1 x5 x11 x12 x13 x2 x14 x15 x16 x6 x17 x18 x19 x3 x2 1 1 2 The full model 1 1 2 A projected model (with variables ordered in relevance) rstanarm + projpred + bayesplot

Projection: how to choose variables How to choose which variables to set to a zero? Usually not possible to go trough all covariate combinations forward / backward search L1-norm search

Projection: how to choose variables How to choose which variables to set to a zero Only k features have nonzero coefficient ( β = k) Optimization problem : θ s = arg min ˆθ 1 n n i=1 ( ) KL p(ỹ i θ s ) p(ỹ i ˆθ), s.t. β = k, Usually not possible to go trough all covariate combinations with β = k forward / backward search L1-norm

Projective feature selection in practice Finding the optimal projection with k nonzeros ( β = k) is in practice infeasible A good heuristic is to replace the L -constraint with L 1 -constraint (as in LASSO) In this case the procedure becomes equivalent to LASSO, but y i replaced by ȳ i = E[y i D] 1 S s βt s x i (not obvious, derivation omitted) E.g. for Gaussian linear model Compare to LASSO β = arg min β ˆβ = arg min β n (ȳ i β T x i ) 2 + λ β 1 i=1 n (y i β T x i ) 2 + λ β 1 i=1

Projection, more details We use the L 1 -penalization only for ordering the variables Once the variables are ordered, we do the projection for the selected model(s) without the penalty term After the selection, we do the projection by draws, i.e., each θ s is projected individually to its corresponding value θ s with the selected feature combination The noise variance σ 2 is projected via the same min-kl principle (equations omitted)

Selection induced bias in variable selection n = 1 n = 2 n = 4.3.3.3 CV-1.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1.3.3.3 WAIC.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1.3.3.3 DIC.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1.3.3.3 MPP.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1.3.3.3 BMA-ref.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1.3.3.3 BMA-proj.3.3.6.6 25 5 75 1.3.6 25 5 75 1 25 5 75 1 Piironen & Vehtari (217)

Selection induced bias in variable selection n = 1 n = 2 n = 4 CV-1.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1 WAIC.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1 DIC.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1 MPP.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1 BMA-ref.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1 BMA-proj.3.3.3.6.6 25 5 75 1.6 25 5 75 1 25 5 75 1 Piironen & Vehtari (217)

Selection induced bias in variable selection CV-1 / IS-LOO-CV WAIC DIC MPP BMA-ref BMA-proj.2.2.4 1 2 3.2.2 Ionosphere.4 1 2 3.2.2.4 1 2 3.2.2.4 1 2 3.2.2.4 1 2 3.2.2.4 1 2 3.1.1.2.1.1.2.1.1.2.1.1.2.1.1.2.1.1.2 Sonar 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6.2.2.4.2.2.4.2.2.4.2.2.4.2.2.4.2.2.4 Ovarian 5 1 5 1 5 1 5 1 5 1 5 1.2.2 Colon.4 5 1.2.2.4 5 1.2.2.4 5 1.2.2.4 5 1.2.2.4 5 1.2.2.4 5 1 Piironen & Vehtari (217)

Simulated example n = 8, p = 2, only 7 features are relevant 4 MSE 3 2 2 4 6 Number of features Lasso-path when λ is varied, optimal model size by cross-validation (dotted) vertical axis shows the test error

Simulated example 1..5 βj. -.5 5 1 15 2 Feature LASSO regression coefficients, when λ is selected by cross-validation (42 features selected)

Simulated example 4 MSE 3 2 2 4 6 Number of features Lasso-path (black)

Simulated example 4 MSE 3 2 2 4 6 Number of features Lasso-path (black), full model Bayes with HS-prior (dashed)

Simulated example 4 MSE 3 2 2 4 6 Number of features Lasso-path (black), full model Bayes with HS-prior (dashed) and the L 1 -projection (blue)

Simulated example 1..5 βj. -.5 5 1 15 2 Feature LASSO regression coefficients

Simulated example 1..5 βj. -.5 5 1 15 2 Feature Full Bayes regression coefficients (with horseshoe prior)

Simulated example 1..5 βj. -.5 5 1 15 2 Feature Projected regression coefficients

Projection for Gaussian Processes For Gaussian processes integrate over latent values project only hyperparameters using equations given by Piironen, J., and Vehtari, A. (216). Projection predictive input variable selection for Gaussian process models. In Machine Learning for Signal Processing (MLSP), 216 IEEE International Workshop on, doi:1.119/mlsp.216.7738829. arxiv preprint arxiv:151.4813. Boston Housing (D = 13) Automobile (D = 38) Crime (D = 12).5.5 1 MLPD 1 1.5 1 1.5 1.2 1.4 Full model ARD Projection 5 1 5 1 15 2 4 6 8 1 Number of variables Number of variables Number of variables

Projection, pros and cons Pros: Fast to compute Reliable, often performs favorably in comparison to other methods (Piironen and Vehtari, 217b) 1 Cons: Requires constructing the full model first Software: R-package projpred, designed to be compatible with rstanarm, https://github.com/stan-dev/projpred, https://users.aalto.fi/~jtpiiron/projpred/quickstart.html 1 Comparison of Bayesian predictive methods for model selection http://link.springer.com/article/1.17/s11222-16-9649-y