Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

Aki Vehtari, Aalto University, Finland Probabilistic machine learning group, Aalto University http://research.cs.aalto.fi/pml/ Bayesian theory and methods, approximative integration, model assessment and selection, Gaussian processes, digital health, and disease risk prediction

Outline for this talk Joint work with Juho Piironen Large p, small n (generalized) linear regression Priors for large p, small n (generalized) linear regression Bayesian predictive model selection and reduction

Generalized linear models Gaussian linear model for y R y i = β T x i + ε i, ε i N(, σ 2 ), i = 1,..., n Logistic regression (binary classification) for y {, 1} 1 P(y i = 1) = 1 + exp( β T x i )

Large p, small n regression Linear or generalized linear regression number of covariates p number of observations n Large p, small n common e.g. in modern medical/bioinformatics studies (e.g. microarrays, GWAS) brain imaging in our examples p is around 1e2 1e5, and usually n < 1

Large p, small n regression If noiseless observations we can fit uniquely identified regression in n 1 dimensions If noisy observations, more complicated If correlating covariates, more complicated

Large p, small n regression Priors! Non-sparse priors assume most covariates are relevant, but may have strong correlations factor models Sparse priors assume only small number of covariates effectively non-zero m eff n

Example (Intercept) x1 x2 x3 x4 x5 x6 x7 x8 x9 x1 x11 x12 x13 x14 x15 x16 x17 x18 x19 x2 sigma (Intercept) x1 x2 x3 x4 x5 x6 x7 x8 x9 x1 x11 x12 x13 x14 x15 x16 x17 x18 x19 x2 sigma 2 1 1 2 Gaussian prior 2 1 1 2 3 Horseshoe prior rstanarm + bayesplot

Example Gaussian vs. Horseshoe predictive performance using cross-validation (loo package) > compare ( loog, loohs ) e l p d _ d i f f se 7.9 2.8

Large p, small n regression Sparse priors assume only small number of covariates effectively non-zero m eff n Laplace prior ( Bayesian lasso ) computationally convenient (continuous and log-concave), but not really sparse spike-and-slab (with point-mass at zero) prior on number of non-zero covariates, discrete Horseshoe and hierarchical shrinkage priors prior on amount of shrinkage, continuous

Continuous vs. discrete prior Spike and slab prior (with point-mass at zero) has mix of continuous prior and probability mass at zero parameter space is mixture of continuous and discrete Hierarchical shrinkage and horseshoe priors are continuous assumes varying amount of shrinkage Markov chain Monte Carlo inference can benefit from the geometry Hamiltonian Monte Carlo

Horseshoe prior Linear regression model with covariates x = (x 1,..., x D ) ( y i = β T x i + ε i, ε i N, σ 2), i = 1,..., n, The horseshoe prior: ( β j λ j, τ N, λ 2 j τ 2), λ j C + (, 1), j = 1,..., D. The global parameter τ shrinks all β j towards zero The local parameters λ j allow some β j to escape the shrinkage.45 2 p(β 1 ).3 β 2.15 2 2 2 2 2 β 1 β 1

Horseshoe prior Given the hyperparameters, the posterior mean satisfies approximately β j = (1 κ j )β ML j, κ j = 1 1 + nσ 2 τ 2 λ 2 j, where κ j is the shrinkage factor With λ j C + (, 1), the prior for κ j looks like: nσ 2 τ 2 = 1..5 1 κ

The global shrinkage parameter τ Effective number of nonzero coefficients m eff = D (1 κ j ) j=1 The prior mean can be shown to be E[m eff τ, σ] = τσ 1 n 1 + τσ 1 n D Setting E[m eff τ, σ] = p (prior guess for the number of nonzero coefficients) yields for τ τ = p D p σ n Prior guess for τ based on our beliefs about the sparsity

Illustration p(τ) vs. p(m eff ) Let n = 1, σ = 1, p = 5, τ = p D p σ n, D = dimensionality p(m eff ) with different choices of p(τ): τ = τ τ N +( ), τ 2 τ C +( ), τ 2 τ C + (, 1) D = 1 2 4 6 8 1 2 4 6 8 1 2 4 6 8 1 2 4 6 8 1 D = 1 5 1 15 2 5 1 15 2 2 4 6 8 1 2 4 6 81,

Non-Gaussian observation models The reference value: τ = p D p σ n The framework can be applied also to non-gaussian observation models by deriving appropriate plug-in values for σ Gaussian approximation to the likelihood E.g. σ = 2 for logistic regression

Horseshoe in rstanarm Easy in rstanarm p <- 5 tau <- p/(d-p) * 1/sqrt(n) prior_coeff <- hs(df=1, global_df=1, global_scale=tau) fit <- stan_glm(y x, gaussian(), prior = prior_coeff, adapt_delta =.999)

Experiments Table: Summary of the real world datasets, D denotes the number of predictors and n the dataset size. Dataset Type D n Ovarian Classification 1536 54 Colon Classification 2 62 Prostate Classification 5966 12 ALLAML Classification 7129 72 Corn (4 targets) Regression 7 8

Effect of p(τ) on parameter estimates τ N +( ), τ 2 τ C +( ), τ 2 τ C + (, 1) Ovarian cancer data (n = 54, D = 1536): 4 3 2 1.1.2 τ 5 1 15 m eff 5 1, 1,5 Variable j 8 6 4 2.5.1 τ 2 4 6 8 1 m eff 5 1, 1,5 Variable j 4 3 2 1 2 4 6 8 1 τ 5 1, 1,5 m eff 5 1, 1,5 Variable j Choose τ according to a prior guess p = 3. Prior and posterior samples for τ and m eff, and absolute posterior mean coefficients β j.

Effect of p(τ) on prediction accuracy (1/2) Ovarian Colon Prostate ALLAML 1 3 1 3 1 3 1 3 m eff 1 1 1 1 1 1 1 1 MLPD Time (min) 1 1.25.3.35 1 1 1 1 1 3 1 1 1 1 1 3 4 3 2 1 1 1 1 1 1 3 p 1 1 1 1 1 1 1 3.4.45.5.55 1 1 1 1 1 3 6 4 2 1 1 1 1 1 3 p 1 1 1 1 1 1 1 3.2.25.3.35 1 1 1 1 1 3 3 2 1 1 1 1 1 1 3 p For various p transformed into τ, with τ N +(, τ 2 ) (red) and τ C + (, τ 2 ) (yellow): 1 1.2.25.3.35 1 1 1 1 1 3 1 1 1 1 1 3 4 3 2 1 1 1 1 1 1 3 p Posterior mean m eff, mean log predictive density (MLPD) on test data (dashed line denotes LASSO), and computation time.

Effect of p(τ) on prediction accuracy (2/2) Posterior mean m eff, mean log predictive density (MLPD) on test data (dashed line denotes LASSO), and computation time. 1 3 Corn-Moisture Corn-Oil Corn-Protein Corn-Starch 1 3 1 3 1 3 m eff 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 3 1 1 1 1 1 3 1 1 1 1 1 3 MLPD Time (min).4.6.8 1 1.2 15 1 5 1 1 1 1 1 3 1 1 1 1 1 3.8 1 1.2 1.4 15 1 5 1 1 1 1 1 3 1 1 1 1 1 3.5 1 1.5 15 1 5 1 1 1 1 1 3 1 1 1 1 1 3.6.8 1 1.2 1.4 15 1 5 1 1 1 1 1 3 1 1 1 1 1 3 p p p p For various p transformed into τ, with τ N +(, τ 2 ) (red) and τ C + (, τ 2 ) (yellow):

Summary of hyperprior choice for horseshoe prior The global shrinkage parameter τ effectively determines the level of sparsity The prior for p(τ) can have a significant effect on the inference results Our framework allows the modeller to calibrate the prior for τ based on the prior beliefs about the sparsity The concept of effective number of nonzero regression coefficients m eff could be applied also to other shrinkage priors Juho Piironen and Aki Vehtari (217). On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior. Proceedings of the 2th International Conference on Artificial Intelligence and Statistics, PMLR 54:95-913.

Why model selection? Assume a model rich enough capturing lot of uncertainties Bayesian theory says to integrate over all uncertainties model criticism and predictive assessment done if we are happy with the model, no need for model selection Radford Neal won the NIPS 23 feature selection competition using Bayesian methods with all the features (5 1 ) Box: All models are wrong, but some are useful there are known unknowns and unknown unknowns Model selection what if some smaller (or more sparse) or parametric model is practically as good? which uncertainties can be ignored? (e.g. Student-t vs. Gaussian, irrelevant covariates) reduced measurement cost, simpler to explain (e.g. less biomarkers, and easier to explain to doctors)

Marginal posterior probabilities and intervals Marginal posterior probabilities and intervals problems when posterior dependencies, e.g. correlation of covariates weak identifiability of length scale and magnitude.4 horseshoe Laplace Gaussian.2 β.2 1 1 2 3 4 5 55 biomarker from Peltola, Havulinna, Salomaa, and Vehtari (214)

Marginal posterior intervals 3 2 x1 1 x2 x2 1 2 2 2 Marginal posterior intervals 2 1 1 2 3 x1 Joint posterior density rstanarm + bayesplot

Bayes factor and marginal likelihood Marginal likelihood in Bayes factor can be factored using the chain rule p(y M k ) = p(y 1 M k )p(y 2 y 1, M k ),..., p(y n y 1,..., y n 1, M k ) Sensitive to the first terms, and not defined if the prior is improper sensitive to the prior choices especially problematic to use for models with large difference in the number of parameters problematic if none of the models is close to the true model

Predictive model selection Goodness of the model is evaluated by its predictive performance Select a simpler model whose predictive performance is similar to the rich model

Bayesian predictive methods Ways to approximate the predictive performance of using posterior predictive distribution Bayesian cross-validation Watanabe-Akaike Information Criterion reference predictive methods Many other Bayesian predictive methods estimating something else, e.g., DIC L-criterion, posterior predictive criterion See decision theoretic review and more methods in Vehtari, A., and Ojanen, J. (212). A survey of Bayesian predictive methods for model assessment, selection and comparison. Statistics Surveys 6, 142 228. For fast Pareto smoothed importance sampling leave-one-out cross-validation see Aki Vehtari, Andrew Gelman and Jonah Gabry (217). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5):1413 1432. doi:1.17/s11222-16-9696-4. arxiv preprint arxiv:157.4544.

Selection induced bias in variable selection Even if the model performance estimate is unbiased (like LOO-CV), using it for model selection introduces additional fitting to the data Performance of the selection process itself can be assessed using two level cross-validation, but it does not help choosing better models Bigger problem if there is a large number of models as in covariate selection Juho Piironen and Aki Vehtari (217). Comparison of Bayesian predictive methods for model selection. Statistics and Computing, 27(3):711-735. doi:1.17/s11222-16-9649-y. arxiv preprint arxiv:153.865.

Selection induced bias in variable selection n = 2 n = 5 n = 1.5 1.4 1.5 1.5 2.5 2.4 1.8 3.5 25 5 3.3 25 5 2.2 25 5

Selection induced bias in variable selection n = 1 n = 2 n = 4.3.3.3 CV-1.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1.3.3.3 WAIC.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1.3.3.3 DIC.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1.3.3.3 MPP.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1.3.3.3 BMA-ref.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1.3.3.3 BMA-proj.3.3.6.6 25 5 75 1.3.6 25 5 75 1 25 5 75 1 Piironen & Vehtari (217)

Selection induced bias in variable selection n = 1 n = 2 n = 4 CV-1.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1 WAIC.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1 DIC.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1 MPP.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1 BMA-ref.3.3.3.6.6.6 25 5 75 1 25 5 75 1 25 5 75 1 BMA-proj.3.3.3.6.6 25 5 75 1.6 25 5 75 1 25 5 75 1 Piironen & Vehtari (217)

Selection induced bias in variable selection CV-1 / IS-LOO-CV WAIC DIC MPP BMA-ref BMA-proj.2.2.4 1 2 3.2.2 Ionosphere.4 1 2 3.2.2.4 1 2 3.2.2.4 1 2 3.2.2.4 1 2 3.2.2.4 1 2 3.1.1.2.1.1.2.1.1.2.1.1.2.1.1.2.1.1.2 Sonar 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6.2.2.4.2.2.4.2.2.4.2.2.4.2.2.4.2.2.4 Ovarian 5 1 5 1 5 1 5 1 5 1 5 1.2.2 Colon.4 5 1.2.2.4 5 1.2.2.4 5 1.2.2.4 5 1.2.2.4 5 1.2.2.4 5 1 Piironen & Vehtari (217)

Marginal posterior probabilities Marginal posterior probabilities are related to Bayes factor works better than LOO-CV, but submodels are still ignoring uncertainty in the removed parts which leads to overfitting in selection process

Integrate over everything Bayes theory says we should integrate over all uncertainties e.g. integrate over different variable combinations (BMA) or use sparsifying hierarchical shrinkage priors (Intercept) x1 x2 x3 x4 x5 x6 x7 x8 x9 x1 x11 x12 x13 x14 x15 x16 x17 x18 x19 x2 sigma 2 1 1 2 3 If we select a reduced model we are ignoring some uncertainties How to not ignore these uncertainties?

Projective predictive covariate selection, idea The full model posterior distribution is our best representation of the uncertainties What is the best distribution given a constraint that only selected covariates have nonzero coefficient? Optimal projection from the full posterior to a sparse posterior with minimal predictive loss

Projective predictive covariate selection, idea The full model predictive distribution represents our best knowledge about future ỹ p(ỹ D) = p(ỹ θ)p(θ D)dθ, where θ = (β, σ 2 )) and β is in general non-sparse (all β j ) What is the best distribution q (θ) given a constraint that only selected covariates have nonzero coefficient Optimization problem: 1 q = arg min q n n i=1 ( KL p(ỹ i D) ) p(ỹ i θ)q(θ)dθ Optimal projection from the full posterior to a sparse posterior (with minimal predictive loss)

Projective predictive feature selection, computation We have posterior draws {θ s } S s=1, for the full model (θ = (β, σ2 )) and β is in general non-sparse (all β j ) The predictive distribution p(ỹ D) 1 S s p(ỹ θs ) represents our best knowledge about future ỹ Easier optimization problem by changing the order of integration and optimization (Goutis & Robert, 1998): θ s θ s = arg min ˆθ 1 n n i=1 ( ) KL p(ỹ i θ s ) p(ỹ i ˆθ) are now (approximate) draws from the projected distribution

Projection by draws Projection of one Monte Carlo sample can be solved Gaussian case: analytically exponential family case: equivalent to finding the maximum likelihood parameters for the submodel with the observations replaced by the fit of the reference model (Goutis & Robert, 1998; Dupuis & Robert, 23)

Example x1 x2 x3 x4 x1 x5 x6 x7 x14 x8 x9 x1 x5 x11 x12 x13 x2 x14 x15 x16 x6 x17 x18 x19 x3 x2 1 1 2 The full model 1 1 2 A projected model (with variables ordered in relevance) rstanarm + projpred + bayesplot

Projection: how to choose variables How to choose which variables to set to a zero? Usually not possible to go trough all covariate combinations forward / backward search L1-norm search

Projection: how to choose variables How to choose which variables to set to a zero Only k features have nonzero coefficient ( β = k) Optimization problem : θ s = arg min ˆθ 1 n n i=1 ( ) KL p(ỹ i θ s ) p(ỹ i ˆθ), s.t. β = k, Usually not possible to go trough all covariate combinations with β = k forward / backward search L1-norm

Projective feature selection in practice Finding the optimal projection with k nonzeros ( β = k) is in practice infeasible A good heuristic is to replace the L -constraint with L 1 -constraint (as in LASSO) In this case the procedure becomes equivalent to LASSO, but y i replaced by ȳ i = E[y i D] 1 S s βt s x i (not obvious, derivation omitted) E.g. for Gaussian linear model Compare to LASSO β = arg min β ˆβ = arg min β n (ȳ i β T x i ) 2 + λ β 1 i=1 n (y i β T x i ) 2 + λ β 1 i=1

Projection, more details We use the L 1 -penalization only for ordering the variables Once the variables are ordered, we do the projection for the selected model(s) without the penalty term After the selection, we do the projection by draws, i.e., each θ s is projected individually to its corresponding value θ s with the selected feature combination The noise variance σ 2 is projected via the same min-kl principle (equations omitted)

Simulated example n = 8, p = 2, only 7 features are relevant 4 MSE 3 2 2 4 6 Number of features Lasso-path when λ is varied, optimal model size by cross-validation (dotted) vertical axis shows the test error

Simulated example 1..5 βj. -.5 5 1 15 2 Feature LASSO regression coefficients, when λ is selected by cross-validation (42 features selected)

Simulated example 4 MSE 3 2 2 4 6 Number of features Lasso-path (black)

Simulated example 4 MSE 3 2 2 4 6 Number of features Lasso-path (black), full model Bayes with HS-prior (dashed)

Simulated example 4 MSE 3 2 2 4 6 Number of features Lasso-path (black), full model Bayes with HS-prior (dashed) and the L 1 -projection (blue)

Simulated example 1..5 βj. -.5 5 1 15 2 Feature LASSO regression coefficients

Simulated example 1..5 βj. -.5 5 1 15 2 Feature Full Bayes regression coefficients (with horseshoe prior)

Simulated example 1..5 βj. -.5 5 1 15 2 Feature Projected regression coefficients

Projection for Gaussian Processes For Gaussian processes integrate over latent values project only hyperparameters using equations given by Piironen, J., and Vehtari, A. (216). Projection predictive input variable selection for Gaussian process models. In Machine Learning for Signal Processing (MLSP), 216 IEEE International Workshop on, doi:1.119/mlsp.216.7738829. arxiv preprint arxiv:151.4813. Boston Housing (D = 13) Automobile (D = 38) Crime (D = 12).5.5 1 MLPD 1 1.5 1 1.5 1.2 1.4 Full model ARD Projection 5 1 5 1 15 2 4 6 8 1 Number of variables Number of variables Number of variables

Projection, pros and cons Pros: Fast to compute Reliable, often performs favorably in comparison to other methods (Piironen and Vehtari, 217b) 1 Cons: Requires constructing the full model first Software: R-package projpred, designed to be compatible with rstanarm, https://github.com/stan-dev/projpred, https://users.aalto.fi/~jtpiiron/projpred/quickstart.html 1 Comparison of Bayesian predictive methods for model selection http://link.springer.com/article/1.17/s11222-16-9649-y