Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

Size: px
Start display at page:

Download "Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model"

Transcription

1 Aki Vehtari, Aalto University, Finland Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model assessment and selection, Gaussian processes, digital health, and disease risk prediction

2 Outline for this talk Joint work with Juho Piironen Large p, small n (generalized) linear regression Priors for large p, small n (generalized) linear regression Bayesian predictive model selection and reduction

3 Generalized linear models Gaussian linear model for y R y i = β T x i + ε i, ε i N(, σ 2 ), i = 1,..., n Logistic regression (binary classification) for y {, 1} 1 P(y i = 1) = 1 + exp( β T x i )

4 Large p, small n regression Linear or generalized linear regression number of covariates p number of observations n Large p, small n common e.g. in modern medical/bioinformatics studies (e.g. microarrays, GWAS) brain imaging in our examples p is around 1e2 1e5, and usually n < 1

5 Large p, small n regression If noiseless observations we can fit uniquely identified regression in n 1 dimensions If noisy observations, more complicated If correlating covariates, more complicated

6 Large p, small n regression Priors! Non-sparse priors assume most covariates are relevant, but may have strong correlations factor models Sparse priors assume only small number of covariates effectively non-zero m eff n

7 Example (Intercept) x1 x2 x3 x4 x5 x6 x7 x8 x9 x1 x11 x12 x13 x14 x15 x16 x17 x18 x19 x2 sigma (Intercept) x1 x2 x3 x4 x5 x6 x7 x8 x9 x1 x11 x12 x13 x14 x15 x16 x17 x18 x19 x2 sigma Gaussian prior Horseshoe prior rstanarm + bayesplot

8 Example Gaussian vs. Horseshoe predictive performance using cross-validation (loo package) > compare ( loog, loohs ) e l p d _ d i f f se

9 Large p, small n regression Sparse priors assume only small number of covariates effectively non-zero m eff n Laplace prior ( Bayesian lasso ) computationally convenient (continuous and log-concave), but not really sparse spike-and-slab (with point-mass at zero) prior on number of non-zero covariates, discrete Horseshoe and hierarchical shrinkage priors prior on amount of shrinkage, continuous

10 Continuous vs. discrete prior Spike and slab prior (with point-mass at zero) has mix of continuous prior and probability mass at zero parameter space is mixture of continuous and discrete Hierarchical shrinkage and horseshoe priors are continuous assumes varying amount of shrinkage Markov chain Monte Carlo inference can benefit from the geometry Hamiltonian Monte Carlo

11 Horseshoe prior Linear regression model with covariates x = (x 1,..., x D ) ( y i = β T x i + ε i, ε i N, σ 2), i = 1,..., n, The horseshoe prior: ( β j λ j, τ N, λ 2 j τ 2), λ j C + (, 1), j = 1,..., D. The global parameter τ shrinks all β j towards zero The local parameters λ j allow some β j to escape the shrinkage.45 2 p(β 1 ).3 β β 1 β 1

12 Horseshoe prior Given the hyperparameters, the posterior mean satisfies approximately β j = (1 κ j )β ML j, κ j = nσ 2 τ 2 λ 2 j, where κ j is the shrinkage factor With λ j C + (, 1), the prior for κ j looks like: nσ 2 τ 2 = κ

13 The global shrinkage parameter τ Effective number of nonzero coefficients m eff = D (1 κ j ) j=1 The prior mean can be shown to be E[m eff τ, σ] = τσ 1 n 1 + τσ 1 n D Setting E[m eff τ, σ] = p (prior guess for the number of nonzero coefficients) yields for τ τ = p D p σ n Prior guess for τ based on our beliefs about the sparsity

14 Illustration p(τ) vs. p(m eff ) Let n = 1, σ = 1, p = 5, τ = p D p σ n, D = dimensionality p(m eff ) with different choices of p(τ): τ = τ τ N +( ), τ 2 τ C +( ), τ 2 τ C + (, 1) D = D = ,

15 Non-Gaussian observation models The reference value: τ = p D p σ n The framework can be applied also to non-gaussian observation models by deriving appropriate plug-in values for σ Gaussian approximation to the likelihood E.g. σ = 2 for logistic regression

16 Horseshoe in rstanarm Easy in rstanarm p <- 5 tau <- p/(d-p) * 1/sqrt(n) prior_coeff <- hs(df=1, global_df=1, global_scale=tau) fit <- stan_glm(y x, gaussian(), prior = prior_coeff, adapt_delta =.999)

17 Experiments Table: Summary of the real world datasets, D denotes the number of predictors and n the dataset size. Dataset Type D n Ovarian Classification Colon Classification 2 62 Prostate Classification ALLAML Classification Corn (4 targets) Regression 7 8

18 Effect of p(τ) on parameter estimates τ N +( ), τ 2 τ C +( ), τ 2 τ C + (, 1) Ovarian cancer data (n = 54, D = 1536): τ m eff 5 1, 1,5 Variable j τ m eff 5 1, 1,5 Variable j τ 5 1, 1,5 m eff 5 1, 1,5 Variable j Choose τ according to a prior guess p = 3. Prior and posterior samples for τ and m eff, and absolute posterior mean coefficients β j.

19 Effect of p(τ) on prediction accuracy (1/2) Ovarian Colon Prostate ALLAML m eff MLPD Time (min) p p p For various p transformed into τ, with τ N +(, τ 2 ) (red) and τ C + (, τ 2 ) (yellow): p Posterior mean m eff, mean log predictive density (MLPD) on test data (dashed line denotes LASSO), and computation time.

20 Effect of p(τ) on prediction accuracy (2/2) Posterior mean m eff, mean log predictive density (MLPD) on test data (dashed line denotes LASSO), and computation time. 1 3 Corn-Moisture Corn-Oil Corn-Protein Corn-Starch m eff MLPD Time (min) p p p p For various p transformed into τ, with τ N +(, τ 2 ) (red) and τ C + (, τ 2 ) (yellow):

21 Summary of hyperprior choice for horseshoe prior The global shrinkage parameter τ effectively determines the level of sparsity The prior for p(τ) can have a significant effect on the inference results Our framework allows the modeller to calibrate the prior for τ based on the prior beliefs about the sparsity The concept of effective number of nonzero regression coefficients m eff could be applied also to other shrinkage priors Juho Piironen and Aki Vehtari (217). On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior. Proceedings of the 2th International Conference on Artificial Intelligence and Statistics, PMLR 54:

22 Why model selection? Assume a model rich enough capturing lot of uncertainties Bayesian theory says to integrate over all uncertainties model criticism and predictive assessment done if we are happy with the model, no need for model selection Radford Neal won the NIPS 23 feature selection competition using Bayesian methods with all the features (5 1 ) Box: All models are wrong, but some are useful there are known unknowns and unknown unknowns Model selection what if some smaller (or more sparse) or parametric model is practically as good? which uncertainties can be ignored? (e.g. Student-t vs. Gaussian, irrelevant covariates) reduced measurement cost, simpler to explain (e.g. less biomarkers, and easier to explain to doctors)

23 Marginal posterior probabilities and intervals Marginal posterior probabilities and intervals problems when posterior dependencies, e.g. correlation of covariates weak identifiability of length scale and magnitude.4 horseshoe Laplace Gaussian.2 β biomarker from Peltola, Havulinna, Salomaa, and Vehtari (214)

24 Marginal posterior intervals 3 2 x1 1 x2 x Marginal posterior intervals x1 Joint posterior density rstanarm + bayesplot

25 Bayes factor and marginal likelihood Marginal likelihood in Bayes factor can be factored using the chain rule p(y M k ) = p(y 1 M k )p(y 2 y 1, M k ),..., p(y n y 1,..., y n 1, M k ) Sensitive to the first terms, and not defined if the prior is improper sensitive to the prior choices especially problematic to use for models with large difference in the number of parameters problematic if none of the models is close to the true model

26 Predictive model selection Goodness of the model is evaluated by its predictive performance Select a simpler model whose predictive performance is similar to the rich model

27 Bayesian predictive methods Ways to approximate the predictive performance of using posterior predictive distribution Bayesian cross-validation Watanabe-Akaike Information Criterion reference predictive methods Many other Bayesian predictive methods estimating something else, e.g., DIC L-criterion, posterior predictive criterion See decision theoretic review and more methods in Vehtari, A., and Ojanen, J. (212). A survey of Bayesian predictive methods for model assessment, selection and comparison. Statistics Surveys 6, For fast Pareto smoothed importance sampling leave-one-out cross-validation see Aki Vehtari, Andrew Gelman and Jonah Gabry (217). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5): doi:1.17/s arxiv preprint arxiv:

28 Selection induced bias in variable selection Even if the model performance estimate is unbiased (like LOO-CV), using it for model selection introduces additional fitting to the data Performance of the selection process itself can be assessed using two level cross-validation, but it does not help choosing better models Bigger problem if there is a large number of models as in covariate selection Juho Piironen and Aki Vehtari (217). Comparison of Bayesian predictive methods for model selection. Statistics and Computing, 27(3): doi:1.17/s y. arxiv preprint arxiv:

29 Selection induced bias in variable selection n = 2 n = 5 n =

30 Selection induced bias in variable selection n = 1 n = 2 n = CV WAIC DIC MPP BMA-ref BMA-proj Piironen & Vehtari (217)

31 Selection induced bias in variable selection n = 1 n = 2 n = 4 CV WAIC DIC MPP BMA-ref BMA-proj Piironen & Vehtari (217)

32 Selection induced bias in variable selection CV-1 / IS-LOO-CV WAIC DIC MPP BMA-ref BMA-proj Ionosphere Sonar Ovarian Colon Piironen & Vehtari (217)

33 Marginal posterior probabilities Marginal posterior probabilities are related to Bayes factor works better than LOO-CV, but submodels are still ignoring uncertainty in the removed parts which leads to overfitting in selection process

34 Integrate over everything Bayes theory says we should integrate over all uncertainties e.g. integrate over different variable combinations (BMA) or use sparsifying hierarchical shrinkage priors (Intercept) x1 x2 x3 x4 x5 x6 x7 x8 x9 x1 x11 x12 x13 x14 x15 x16 x17 x18 x19 x2 sigma If we select a reduced model we are ignoring some uncertainties How to not ignore these uncertainties?

35 Projective predictive covariate selection, idea The full model posterior distribution is our best representation of the uncertainties What is the best distribution given a constraint that only selected covariates have nonzero coefficient? Optimal projection from the full posterior to a sparse posterior with minimal predictive loss

36 Projective predictive covariate selection, idea The full model predictive distribution represents our best knowledge about future ỹ p(ỹ D) = p(ỹ θ)p(θ D)dθ, where θ = (β, σ 2 )) and β is in general non-sparse (all β j ) What is the best distribution q (θ) given a constraint that only selected covariates have nonzero coefficient Optimization problem: 1 q = arg min q n n i=1 ( KL p(ỹ i D) ) p(ỹ i θ)q(θ)dθ Optimal projection from the full posterior to a sparse posterior (with minimal predictive loss)

37 Projective predictive feature selection, computation We have posterior draws {θ s } S s=1, for the full model (θ = (β, σ2 )) and β is in general non-sparse (all β j ) The predictive distribution p(ỹ D) 1 S s p(ỹ θs ) represents our best knowledge about future ỹ Easier optimization problem by changing the order of integration and optimization (Goutis & Robert, 1998): θ s θ s = arg min ˆθ 1 n n i=1 ( ) KL p(ỹ i θ s ) p(ỹ i ˆθ) are now (approximate) draws from the projected distribution

38 Projection by draws Projection of one Monte Carlo sample can be solved Gaussian case: analytically exponential family case: equivalent to finding the maximum likelihood parameters for the submodel with the observations replaced by the fit of the reference model (Goutis & Robert, 1998; Dupuis & Robert, 23)

39 Example x1 x2 x3 x4 x1 x5 x6 x7 x14 x8 x9 x1 x5 x11 x12 x13 x2 x14 x15 x16 x6 x17 x18 x19 x3 x The full model A projected model (with variables ordered in relevance) rstanarm + projpred + bayesplot

40 Projection: how to choose variables How to choose which variables to set to a zero? Usually not possible to go trough all covariate combinations forward / backward search L1-norm search

41 Projection: how to choose variables How to choose which variables to set to a zero Only k features have nonzero coefficient ( β = k) Optimization problem : θ s = arg min ˆθ 1 n n i=1 ( ) KL p(ỹ i θ s ) p(ỹ i ˆθ), s.t. β = k, Usually not possible to go trough all covariate combinations with β = k forward / backward search L1-norm

42 Projective feature selection in practice Finding the optimal projection with k nonzeros ( β = k) is in practice infeasible A good heuristic is to replace the L -constraint with L 1 -constraint (as in LASSO) In this case the procedure becomes equivalent to LASSO, but y i replaced by ȳ i = E[y i D] 1 S s βt s x i (not obvious, derivation omitted) E.g. for Gaussian linear model Compare to LASSO β = arg min β ˆβ = arg min β n (ȳ i β T x i ) 2 + λ β 1 i=1 n (y i β T x i ) 2 + λ β 1 i=1

43 Projection, more details We use the L 1 -penalization only for ordering the variables Once the variables are ordered, we do the projection for the selected model(s) without the penalty term After the selection, we do the projection by draws, i.e., each θ s is projected individually to its corresponding value θ s with the selected feature combination The noise variance σ 2 is projected via the same min-kl principle (equations omitted)

44 Selection induced bias in variable selection n = 1 n = 2 n = CV WAIC DIC MPP BMA-ref BMA-proj Piironen & Vehtari (217)

45 Selection induced bias in variable selection n = 1 n = 2 n = 4 CV WAIC DIC MPP BMA-ref BMA-proj Piironen & Vehtari (217)

46 Selection induced bias in variable selection CV-1 / IS-LOO-CV WAIC DIC MPP BMA-ref BMA-proj Ionosphere Sonar Ovarian Colon Piironen & Vehtari (217)

47 Simulated example n = 8, p = 2, only 7 features are relevant 4 MSE Number of features Lasso-path when λ is varied, optimal model size by cross-validation (dotted) vertical axis shows the test error

48 Simulated example 1..5 βj Feature LASSO regression coefficients, when λ is selected by cross-validation (42 features selected)

49 Simulated example 4 MSE Number of features Lasso-path (black)

50 Simulated example 4 MSE Number of features Lasso-path (black), full model Bayes with HS-prior (dashed)

51 Simulated example 4 MSE Number of features Lasso-path (black), full model Bayes with HS-prior (dashed) and the L 1 -projection (blue)

52 Simulated example 1..5 βj Feature LASSO regression coefficients

53 Simulated example 1..5 βj Feature Full Bayes regression coefficients (with horseshoe prior)

54 Simulated example 1..5 βj Feature Projected regression coefficients

55 Projection for Gaussian Processes For Gaussian processes integrate over latent values project only hyperparameters using equations given by Piironen, J., and Vehtari, A. (216). Projection predictive input variable selection for Gaussian process models. In Machine Learning for Signal Processing (MLSP), 216 IEEE International Workshop on, doi:1.119/mlsp arxiv preprint arxiv: Boston Housing (D = 13) Automobile (D = 38) Crime (D = 12) MLPD Full model ARD Projection Number of variables Number of variables Number of variables

56 Projection, pros and cons Pros: Fast to compute Reliable, often performs favorably in comparison to other methods (Piironen and Vehtari, 217b) 1 Cons: Requires constructing the full model first Software: R-package projpred, designed to be compatible with rstanarm, Comparison of Bayesian predictive methods for model selection

arxiv: v1 [stat.me] 30 Mar 2015

arxiv: v1 [stat.me] 30 Mar 2015 Comparison of Bayesian predictive methods for model selection Juho Piironen and Aki Vehtari arxiv:53.865v [stat.me] 3 Mar 25 Abstract. The goal of this paper is to compare several widely used Bayesian

More information

arxiv: v4 [stat.me] 23 Mar 2016

arxiv: v4 [stat.me] 23 Mar 2016 Comparison of Bayesian predictive methods for model selection Juho Piironen Aki Vehtari Aalto University juho.piironen@aalto.fi aki.vehtari@aalto.fi arxiv:53.865v4 [stat.me] 23 Mar 26 Abstract The goal

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

arxiv: v1 [stat.me] 6 Jul 2017

arxiv: v1 [stat.me] 6 Jul 2017 Sparsity information and regularization in the horseshoe and other shrinkage priors arxiv:77.694v [stat.me] 6 Jul 7 Juho Piironen and Aki Vehtari Helsinki Institute for Information Technology, HIIT Department

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Bayesian leave-one-out cross-validation for large data sets

Bayesian leave-one-out cross-validation for large data sets 1st Symposium on Advances in Approximate Bayesian Inference, 2018 1 5 Bayesian leave-one-out cross-validation for large data sets Måns Magnusson Michael Riis Andersen Aki Vehtari Aalto University mans.magnusson@aalto.fi

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Probabilistic programming and Stan. mc-stan.org

Probabilistic programming and Stan. mc-stan.org Probabilistic programming and Stan mc-stan.org Outline What is probabilistic programming Stan now Stan in the future A bit about other software Probabilistic programming Probabilistic programming framework

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview

More information

Bayesian linear regression

Bayesian linear regression Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding

More information

Bayesian methods in economics and finance

Bayesian methods in economics and finance 1/26 Bayesian methods in economics and finance Linear regression: Bayesian model selection and sparsity priors Linear Regression 2/26 Linear regression Model for relationship between (several) independent

More information

Estimating Sparse High Dimensional Linear Models using Global-Local Shrinkage

Estimating Sparse High Dimensional Linear Models using Global-Local Shrinkage Estimating Sparse High Dimensional Linear Models using Global-Local Shrinkage Daniel F. Schmidt Centre for Biostatistics and Epidemiology The University of Melbourne Monash University May 11, 2017 Outline

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Shunsuke Horii Waseda University s.horii@aoni.waseda.jp Abstract In this paper, we present a hierarchical model which

More information

Part 2: Multivariate fmri analysis using a sparsifying spatio-temporal prior

Part 2: Multivariate fmri analysis using a sparsifying spatio-temporal prior Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 2: Multivariate fmri analysis using a sparsifying spatio-temporal prior Tom Heskes joint work with Marcel van Gerven

More information

R-squared for Bayesian regression models

R-squared for Bayesian regression models R-squared for Bayesian regression models Andrew Gelman Ben Goodrich Jonah Gabry Imad Ali 8 Nov 2017 Abstract The usual definition of R 2 (variance of the predicted values divided by the variance of the

More information

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Bagging During Markov Chain Monte Carlo for Smoother Predictions Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods

More information

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics arxiv:1603.08163v1 [stat.ml] 7 Mar 016 Farouk S. Nathoo, Keelin Greenlaw,

More information

Aki Vehtari, Aalto University

Aki Vehtari, Aalto University Aki Vehtari, Aalto University 1 / 89 Probabilistic machine learning group, Aalto University http://research.cs.aalto.fi/pml/ Bayesian theory and methods, approximative integration, model assessment and

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Lecture 14: Shrinkage

Lecture 14: Shrinkage Lecture 14: Shrinkage Reading: Section 6.2 STATS 202: Data mining and analysis October 27, 2017 1 / 19 Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the

More information

Introduction to Gaussian Process

Introduction to Gaussian Process Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Empirical Bayes, Hierarchical Bayes Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 5: Due April 10. Project description on Piazza. Final details coming

More information

Regularized Regression A Bayesian point of view

Regularized Regression A Bayesian point of view Regularized Regression A Bayesian point of view Vincent MICHEL Director : Gilles Celeux Supervisor : Bertrand Thirion Parietal Team, INRIA Saclay Ile-de-France LRI, Université Paris Sud CEA, DSV, I2BM,

More information

Statistical Inference

Statistical Inference Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Bayesian model selection: methodology, computation and applications

Bayesian model selection: methodology, computation and applications Bayesian model selection: methodology, computation and applications David Nott Department of Statistics and Applied Probability National University of Singapore Statistical Genomics Summer School Program

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu Lecture: Gaussian Process Regression STAT 6474 Instructor: Hongxiao Zhu Motivation Reference: Marc Deisenroth s tutorial on Robot Learning. 2 Fast Learning for Autonomous Robots with Gaussian Processes

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

COS513 LECTURE 8 STATISTICAL CONCEPTS

COS513 LECTURE 8 STATISTICAL CONCEPTS COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

y Xw 2 2 y Xw λ w 2 2

y Xw 2 2 y Xw λ w 2 2 CS 189 Introduction to Machine Learning Spring 2018 Note 4 1 MLE and MAP for Regression (Part I) So far, we ve explored two approaches of the regression framework, Ordinary Least Squares and Ridge Regression:

More information

Active and Semi-supervised Kernel Classification

Active and Semi-supervised Kernel Classification Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),

More information

Penalized Loss functions for Bayesian Model Choice

Penalized Loss functions for Bayesian Model Choice Penalized Loss functions for Bayesian Model Choice Martyn International Agency for Research on Cancer Lyon, France 13 November 2009 The pure approach For a Bayesian purist, all uncertainty is represented

More information

Dimension Reduction Methods

Dimension Reduction Methods Dimension Reduction Methods And Bayesian Machine Learning Marek Petrik 2/28 Previously in Machine Learning How to choose the right features if we have (too) many options Methods: 1. Subset selection 2.

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Hierarchical Models & Bayesian Model Selection

Hierarchical Models & Bayesian Model Selection Hierarchical Models & Bayesian Model Selection Geoffrey Roeder Departments of Computer Science and Statistics University of British Columbia Jan. 20, 2016 Contact information Please report any typos or

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

STAT 518 Intro Student Presentation

STAT 518 Intro Student Presentation STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information

A Review of Pseudo-Marginal Markov Chain Monte Carlo

A Review of Pseudo-Marginal Markov Chain Monte Carlo A Review of Pseudo-Marginal Markov Chain Monte Carlo Discussed by: Yizhe Zhang October 21, 2016 Outline 1 Overview 2 Paper review 3 experiment 4 conclusion Motivation & overview Notation: θ denotes the

More information

GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation Patrick Breheny February 8 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/27 Introduction Basic idea Standardization Large-scale testing is, of course, a big area and we could keep talking

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Nonparameteric Regression:

Nonparameteric Regression: Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

Partial factor modeling: predictor-dependent shrinkage for linear regression

Partial factor modeling: predictor-dependent shrinkage for linear regression modeling: predictor-dependent shrinkage for linear Richard Hahn, Carlos Carvalho and Sayan Mukherjee JASA 2013 Review by Esther Salazar Duke University December, 2013 Factor framework The factor framework

More information

g-priors for Linear Regression

g-priors for Linear Regression Stat60: Bayesian Modeling and Inference Lecture Date: March 15, 010 g-priors for Linear Regression Lecturer: Michael I. Jordan Scribe: Andrew H. Chan 1 Linear regression and g-priors In the last lecture,

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

CS Homework 3. October 15, 2009

CS Homework 3. October 15, 2009 CS 294 - Homework 3 October 15, 2009 If you have questions, contact Alexandre Bouchard (bouchard@cs.berkeley.edu) for part 1 and Alex Simma (asimma@eecs.berkeley.edu) for part 2. Also check the class website

More information

Contents. Part I: Fundamentals of Bayesian Inference 1

Contents. Part I: Fundamentals of Bayesian Inference 1 Contents Preface xiii Part I: Fundamentals of Bayesian Inference 1 1 Probability and inference 3 1.1 The three steps of Bayesian data analysis 3 1.2 General notation for statistical inference 4 1.3 Bayesian

More information

Model Selection for Gaussian Processes

Model Selection for Gaussian Processes Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal

More information

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA) Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA) Willem van den Boom Department of Statistics and Applied Probability National University

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Consistent high-dimensional Bayesian variable selection via penalized credible regions Consistent high-dimensional Bayesian variable selection via penalized credible regions Howard Bondell bondell@stat.ncsu.edu Joint work with Brian Reich Howard Bondell p. 1 Outline High-Dimensional Variable

More information

Variational Dropout via Empirical Bayes

Variational Dropout via Empirical Bayes Variational Dropout via Empirical Bayes Valery Kharitonov 1 kharvd@gmail.com Dmitry Molchanov 1, dmolch111@gmail.com Dmitry Vetrov 1, vetrovd@yandex.ru 1 National Research University Higher School of Economics,

More information

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Multiple regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Multiple regression 1 / 36 Previous two lectures Linear and logistic

More information

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks (9) Model selection and goodness-of-fit checks Objectives In this module we will study methods for model comparisons and checking for model adequacy For model comparisons there are a finite number of candidate

More information

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning Professor Erik Sudderth Brown University Computer Science October 4, 2016 Some figures and materials courtesy

More information

Integrated Non-Factorized Variational Inference

Integrated Non-Factorized Variational Inference Integrated Non-Factorized Variational Inference Shaobo Han, Xuejun Liao and Lawrence Carin Duke University February 27, 2014 S. Han et al. Integrated Non-Factorized Variational Inference February 27, 2014

More information

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:

More information

Disease mapping with Gaussian processes

Disease mapping with Gaussian processes EUROHEIS2 Kuopio, Finland 17-18 August 2010 Aki Vehtari (former Helsinki University of Technology) Department of Biomedical Engineering and Computational Science (BECS) Acknowledgments Researchers - Jarno

More information

Pareto Smoothed Importance Sampling

Pareto Smoothed Importance Sampling Pareto Smoothed Importance Sampling Aki Vehtari Andrew Gelman Jonah Gabry 22 September 216 Abstract Importance weighting is a convenient general way to adjust for draws from the wrong distribution, but

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Introduction to Statistical modeling: handout for Math 489/583

Introduction to Statistical modeling: handout for Math 489/583 Introduction to Statistical modeling: handout for Math 489/583 Statistical modeling occurs when we are trying to model some data using statistical tools. From the start, we recognize that no model is perfect

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information

GWAS IV: Bayesian linear (variance component) models

GWAS IV: Bayesian linear (variance component) models GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS IV: Bayesian

More information

High-Dimensional Statistical Learning: Introduction

High-Dimensional Statistical Learning: Introduction Classical Statistics Biological Big Data Supervised and Unsupervised Learning High-Dimensional Statistical Learning: Introduction Ali Shojaie University of Washington http://faculty.washington.edu/ashojaie/

More information

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning CSC412 Probabilistic Learning & Reasoning Lecture 12: Bayesian Parameter Estimation February 27, 2006 Sam Roweis Bayesian Approach 2 The Bayesian programme (after Rev. Thomas Bayes) treats all unnown quantities

More information

Bayesian Linear Regression

Bayesian Linear Regression Bayesian Linear Regression Sudipto Banerjee 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. September 15, 2010 1 Linear regression models: a Bayesian perspective

More information

36-720: The Rasch Model

36-720: The Rasch Model 36-720: The Rasch Model Brian Junker October 15, 2007 Multivariate Binary Response Data Rasch Model Rasch Marginal Likelihood as a GLMM Rasch Marginal Likelihood as a Log-Linear Model Example For more

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee and Andrew O. Finley 2 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department of Forestry & Department

More information

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models Statistics 203: Introduction to Regression and Analysis of Variance Penalized models Jonathan Taylor - p. 1/15 Today s class Bias-Variance tradeoff. Penalized regression. Cross-validation. - p. 2/15 Bias-variance

More information

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Bayesian Grouped Horseshoe Regression with Application to Additive Models Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu 1,2, Daniel F. Schmidt 1, Enes Makalic 1, Guoqi Qian 2, John L. Hopper 1 1 Centre for Epidemiology and Biostatistics,

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department of Forestry & Department

More information

Linear model selection and regularization

Linear model selection and regularization Linear model selection and regularization Problems with linear regression with least square 1. Prediction Accuracy: linear regression has low bias but suffer from high variance, especially when n p. It

More information

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010 Model-Averaged l 1 Regularization using Markov Chain Monte Carlo Model Composition Technical Report No. 541 Department of Statistics, University of Washington Chris Fraley and Daniel Percival August 22,

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information