Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model
|
|
- Edward Melton
- 6 years ago
- Views:
Transcription
1 Aki Vehtari, Aalto University, Finland Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model assessment and selection, Gaussian processes, digital health, and disease risk prediction
2 Outline for this talk Joint work with Juho Piironen Large p, small n (generalized) linear regression Priors for large p, small n (generalized) linear regression Bayesian predictive model selection and reduction
3 Generalized linear models Gaussian linear model for y R y i = β T x i + ε i, ε i N(, σ 2 ), i = 1,..., n Logistic regression (binary classification) for y {, 1} 1 P(y i = 1) = 1 + exp( β T x i )
4 Large p, small n regression Linear or generalized linear regression number of covariates p number of observations n Large p, small n common e.g. in modern medical/bioinformatics studies (e.g. microarrays, GWAS) brain imaging in our examples p is around 1e2 1e5, and usually n < 1
5 Large p, small n regression If noiseless observations we can fit uniquely identified regression in n 1 dimensions If noisy observations, more complicated If correlating covariates, more complicated
6 Large p, small n regression Priors! Non-sparse priors assume most covariates are relevant, but may have strong correlations factor models Sparse priors assume only small number of covariates effectively non-zero m eff n
7 Example (Intercept) x1 x2 x3 x4 x5 x6 x7 x8 x9 x1 x11 x12 x13 x14 x15 x16 x17 x18 x19 x2 sigma (Intercept) x1 x2 x3 x4 x5 x6 x7 x8 x9 x1 x11 x12 x13 x14 x15 x16 x17 x18 x19 x2 sigma Gaussian prior Horseshoe prior rstanarm + bayesplot
8 Example Gaussian vs. Horseshoe predictive performance using cross-validation (loo package) > compare ( loog, loohs ) e l p d _ d i f f se
9 Large p, small n regression Sparse priors assume only small number of covariates effectively non-zero m eff n Laplace prior ( Bayesian lasso ) computationally convenient (continuous and log-concave), but not really sparse spike-and-slab (with point-mass at zero) prior on number of non-zero covariates, discrete Horseshoe and hierarchical shrinkage priors prior on amount of shrinkage, continuous
10 Continuous vs. discrete prior Spike and slab prior (with point-mass at zero) has mix of continuous prior and probability mass at zero parameter space is mixture of continuous and discrete Hierarchical shrinkage and horseshoe priors are continuous assumes varying amount of shrinkage Markov chain Monte Carlo inference can benefit from the geometry Hamiltonian Monte Carlo
11 Horseshoe prior Linear regression model with covariates x = (x 1,..., x D ) ( y i = β T x i + ε i, ε i N, σ 2), i = 1,..., n, The horseshoe prior: ( β j λ j, τ N, λ 2 j τ 2), λ j C + (, 1), j = 1,..., D. The global parameter τ shrinks all β j towards zero The local parameters λ j allow some β j to escape the shrinkage.45 2 p(β 1 ).3 β β 1 β 1
12 Horseshoe prior Given the hyperparameters, the posterior mean satisfies approximately β j = (1 κ j )β ML j, κ j = nσ 2 τ 2 λ 2 j, where κ j is the shrinkage factor With λ j C + (, 1), the prior for κ j looks like: nσ 2 τ 2 = κ
13 The global shrinkage parameter τ Effective number of nonzero coefficients m eff = D (1 κ j ) j=1 The prior mean can be shown to be E[m eff τ, σ] = τσ 1 n 1 + τσ 1 n D Setting E[m eff τ, σ] = p (prior guess for the number of nonzero coefficients) yields for τ τ = p D p σ n Prior guess for τ based on our beliefs about the sparsity
14 Illustration p(τ) vs. p(m eff ) Let n = 1, σ = 1, p = 5, τ = p D p σ n, D = dimensionality p(m eff ) with different choices of p(τ): τ = τ τ N +( ), τ 2 τ C +( ), τ 2 τ C + (, 1) D = D = ,
15 Non-Gaussian observation models The reference value: τ = p D p σ n The framework can be applied also to non-gaussian observation models by deriving appropriate plug-in values for σ Gaussian approximation to the likelihood E.g. σ = 2 for logistic regression
16 Horseshoe in rstanarm Easy in rstanarm p <- 5 tau <- p/(d-p) * 1/sqrt(n) prior_coeff <- hs(df=1, global_df=1, global_scale=tau) fit <- stan_glm(y x, gaussian(), prior = prior_coeff, adapt_delta =.999)
17 Experiments Table: Summary of the real world datasets, D denotes the number of predictors and n the dataset size. Dataset Type D n Ovarian Classification Colon Classification 2 62 Prostate Classification ALLAML Classification Corn (4 targets) Regression 7 8
18 Effect of p(τ) on parameter estimates τ N +( ), τ 2 τ C +( ), τ 2 τ C + (, 1) Ovarian cancer data (n = 54, D = 1536): τ m eff 5 1, 1,5 Variable j τ m eff 5 1, 1,5 Variable j τ 5 1, 1,5 m eff 5 1, 1,5 Variable j Choose τ according to a prior guess p = 3. Prior and posterior samples for τ and m eff, and absolute posterior mean coefficients β j.
19 Effect of p(τ) on prediction accuracy (1/2) Ovarian Colon Prostate ALLAML m eff MLPD Time (min) p p p For various p transformed into τ, with τ N +(, τ 2 ) (red) and τ C + (, τ 2 ) (yellow): p Posterior mean m eff, mean log predictive density (MLPD) on test data (dashed line denotes LASSO), and computation time.
20 Effect of p(τ) on prediction accuracy (2/2) Posterior mean m eff, mean log predictive density (MLPD) on test data (dashed line denotes LASSO), and computation time. 1 3 Corn-Moisture Corn-Oil Corn-Protein Corn-Starch m eff MLPD Time (min) p p p p For various p transformed into τ, with τ N +(, τ 2 ) (red) and τ C + (, τ 2 ) (yellow):
21 Summary of hyperprior choice for horseshoe prior The global shrinkage parameter τ effectively determines the level of sparsity The prior for p(τ) can have a significant effect on the inference results Our framework allows the modeller to calibrate the prior for τ based on the prior beliefs about the sparsity The concept of effective number of nonzero regression coefficients m eff could be applied also to other shrinkage priors Juho Piironen and Aki Vehtari (217). On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior. Proceedings of the 2th International Conference on Artificial Intelligence and Statistics, PMLR 54:
22 Why model selection? Assume a model rich enough capturing lot of uncertainties Bayesian theory says to integrate over all uncertainties model criticism and predictive assessment done if we are happy with the model, no need for model selection Radford Neal won the NIPS 23 feature selection competition using Bayesian methods with all the features (5 1 ) Box: All models are wrong, but some are useful there are known unknowns and unknown unknowns Model selection what if some smaller (or more sparse) or parametric model is practically as good? which uncertainties can be ignored? (e.g. Student-t vs. Gaussian, irrelevant covariates) reduced measurement cost, simpler to explain (e.g. less biomarkers, and easier to explain to doctors)
23 Marginal posterior probabilities and intervals Marginal posterior probabilities and intervals problems when posterior dependencies, e.g. correlation of covariates weak identifiability of length scale and magnitude.4 horseshoe Laplace Gaussian.2 β biomarker from Peltola, Havulinna, Salomaa, and Vehtari (214)
24 Marginal posterior intervals 3 2 x1 1 x2 x Marginal posterior intervals x1 Joint posterior density rstanarm + bayesplot
25 Bayes factor and marginal likelihood Marginal likelihood in Bayes factor can be factored using the chain rule p(y M k ) = p(y 1 M k )p(y 2 y 1, M k ),..., p(y n y 1,..., y n 1, M k ) Sensitive to the first terms, and not defined if the prior is improper sensitive to the prior choices especially problematic to use for models with large difference in the number of parameters problematic if none of the models is close to the true model
26 Predictive model selection Goodness of the model is evaluated by its predictive performance Select a simpler model whose predictive performance is similar to the rich model
27 Bayesian predictive methods Ways to approximate the predictive performance of using posterior predictive distribution Bayesian cross-validation Watanabe-Akaike Information Criterion reference predictive methods Many other Bayesian predictive methods estimating something else, e.g., DIC L-criterion, posterior predictive criterion See decision theoretic review and more methods in Vehtari, A., and Ojanen, J. (212). A survey of Bayesian predictive methods for model assessment, selection and comparison. Statistics Surveys 6, For fast Pareto smoothed importance sampling leave-one-out cross-validation see Aki Vehtari, Andrew Gelman and Jonah Gabry (217). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5): doi:1.17/s arxiv preprint arxiv:
28 Selection induced bias in variable selection Even if the model performance estimate is unbiased (like LOO-CV), using it for model selection introduces additional fitting to the data Performance of the selection process itself can be assessed using two level cross-validation, but it does not help choosing better models Bigger problem if there is a large number of models as in covariate selection Juho Piironen and Aki Vehtari (217). Comparison of Bayesian predictive methods for model selection. Statistics and Computing, 27(3): doi:1.17/s y. arxiv preprint arxiv:
29 Selection induced bias in variable selection n = 2 n = 5 n =
30 Selection induced bias in variable selection n = 1 n = 2 n = CV WAIC DIC MPP BMA-ref BMA-proj Piironen & Vehtari (217)
31 Selection induced bias in variable selection n = 1 n = 2 n = 4 CV WAIC DIC MPP BMA-ref BMA-proj Piironen & Vehtari (217)
32 Selection induced bias in variable selection CV-1 / IS-LOO-CV WAIC DIC MPP BMA-ref BMA-proj Ionosphere Sonar Ovarian Colon Piironen & Vehtari (217)
33 Marginal posterior probabilities Marginal posterior probabilities are related to Bayes factor works better than LOO-CV, but submodels are still ignoring uncertainty in the removed parts which leads to overfitting in selection process
34 Integrate over everything Bayes theory says we should integrate over all uncertainties e.g. integrate over different variable combinations (BMA) or use sparsifying hierarchical shrinkage priors (Intercept) x1 x2 x3 x4 x5 x6 x7 x8 x9 x1 x11 x12 x13 x14 x15 x16 x17 x18 x19 x2 sigma If we select a reduced model we are ignoring some uncertainties How to not ignore these uncertainties?
35 Projective predictive covariate selection, idea The full model posterior distribution is our best representation of the uncertainties What is the best distribution given a constraint that only selected covariates have nonzero coefficient? Optimal projection from the full posterior to a sparse posterior with minimal predictive loss
36 Projective predictive covariate selection, idea The full model predictive distribution represents our best knowledge about future ỹ p(ỹ D) = p(ỹ θ)p(θ D)dθ, where θ = (β, σ 2 )) and β is in general non-sparse (all β j ) What is the best distribution q (θ) given a constraint that only selected covariates have nonzero coefficient Optimization problem: 1 q = arg min q n n i=1 ( KL p(ỹ i D) ) p(ỹ i θ)q(θ)dθ Optimal projection from the full posterior to a sparse posterior (with minimal predictive loss)
37 Projective predictive feature selection, computation We have posterior draws {θ s } S s=1, for the full model (θ = (β, σ2 )) and β is in general non-sparse (all β j ) The predictive distribution p(ỹ D) 1 S s p(ỹ θs ) represents our best knowledge about future ỹ Easier optimization problem by changing the order of integration and optimization (Goutis & Robert, 1998): θ s θ s = arg min ˆθ 1 n n i=1 ( ) KL p(ỹ i θ s ) p(ỹ i ˆθ) are now (approximate) draws from the projected distribution
38 Projection by draws Projection of one Monte Carlo sample can be solved Gaussian case: analytically exponential family case: equivalent to finding the maximum likelihood parameters for the submodel with the observations replaced by the fit of the reference model (Goutis & Robert, 1998; Dupuis & Robert, 23)
39 Example x1 x2 x3 x4 x1 x5 x6 x7 x14 x8 x9 x1 x5 x11 x12 x13 x2 x14 x15 x16 x6 x17 x18 x19 x3 x The full model A projected model (with variables ordered in relevance) rstanarm + projpred + bayesplot
40 Projection: how to choose variables How to choose which variables to set to a zero? Usually not possible to go trough all covariate combinations forward / backward search L1-norm search
41 Projection: how to choose variables How to choose which variables to set to a zero Only k features have nonzero coefficient ( β = k) Optimization problem : θ s = arg min ˆθ 1 n n i=1 ( ) KL p(ỹ i θ s ) p(ỹ i ˆθ), s.t. β = k, Usually not possible to go trough all covariate combinations with β = k forward / backward search L1-norm
42 Projective feature selection in practice Finding the optimal projection with k nonzeros ( β = k) is in practice infeasible A good heuristic is to replace the L -constraint with L 1 -constraint (as in LASSO) In this case the procedure becomes equivalent to LASSO, but y i replaced by ȳ i = E[y i D] 1 S s βt s x i (not obvious, derivation omitted) E.g. for Gaussian linear model Compare to LASSO β = arg min β ˆβ = arg min β n (ȳ i β T x i ) 2 + λ β 1 i=1 n (y i β T x i ) 2 + λ β 1 i=1
43 Projection, more details We use the L 1 -penalization only for ordering the variables Once the variables are ordered, we do the projection for the selected model(s) without the penalty term After the selection, we do the projection by draws, i.e., each θ s is projected individually to its corresponding value θ s with the selected feature combination The noise variance σ 2 is projected via the same min-kl principle (equations omitted)
44 Selection induced bias in variable selection n = 1 n = 2 n = CV WAIC DIC MPP BMA-ref BMA-proj Piironen & Vehtari (217)
45 Selection induced bias in variable selection n = 1 n = 2 n = 4 CV WAIC DIC MPP BMA-ref BMA-proj Piironen & Vehtari (217)
46 Selection induced bias in variable selection CV-1 / IS-LOO-CV WAIC DIC MPP BMA-ref BMA-proj Ionosphere Sonar Ovarian Colon Piironen & Vehtari (217)
47 Simulated example n = 8, p = 2, only 7 features are relevant 4 MSE Number of features Lasso-path when λ is varied, optimal model size by cross-validation (dotted) vertical axis shows the test error
48 Simulated example 1..5 βj Feature LASSO regression coefficients, when λ is selected by cross-validation (42 features selected)
49 Simulated example 4 MSE Number of features Lasso-path (black)
50 Simulated example 4 MSE Number of features Lasso-path (black), full model Bayes with HS-prior (dashed)
51 Simulated example 4 MSE Number of features Lasso-path (black), full model Bayes with HS-prior (dashed) and the L 1 -projection (blue)
52 Simulated example 1..5 βj Feature LASSO regression coefficients
53 Simulated example 1..5 βj Feature Full Bayes regression coefficients (with horseshoe prior)
54 Simulated example 1..5 βj Feature Projected regression coefficients
55 Projection for Gaussian Processes For Gaussian processes integrate over latent values project only hyperparameters using equations given by Piironen, J., and Vehtari, A. (216). Projection predictive input variable selection for Gaussian process models. In Machine Learning for Signal Processing (MLSP), 216 IEEE International Workshop on, doi:1.119/mlsp arxiv preprint arxiv: Boston Housing (D = 13) Automobile (D = 38) Crime (D = 12) MLPD Full model ARD Projection Number of variables Number of variables Number of variables
56 Projection, pros and cons Pros: Fast to compute Reliable, often performs favorably in comparison to other methods (Piironen and Vehtari, 217b) 1 Cons: Requires constructing the full model first Software: R-package projpred, designed to be compatible with rstanarm, Comparison of Bayesian predictive methods for model selection
arxiv: v1 [stat.me] 30 Mar 2015
Comparison of Bayesian predictive methods for model selection Juho Piironen and Aki Vehtari arxiv:53.865v [stat.me] 3 Mar 25 Abstract. The goal of this paper is to compare several widely used Bayesian
More informationarxiv: v4 [stat.me] 23 Mar 2016
Comparison of Bayesian predictive methods for model selection Juho Piironen Aki Vehtari Aalto University juho.piironen@aalto.fi aki.vehtari@aalto.fi arxiv:53.865v4 [stat.me] 23 Mar 26 Abstract The goal
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More informationarxiv: v1 [stat.me] 6 Jul 2017
Sparsity information and regularization in the horseshoe and other shrinkage priors arxiv:77.694v [stat.me] 6 Jul 7 Juho Piironen and Aki Vehtari Helsinki Institute for Information Technology, HIIT Department
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationBayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationBayesian leave-one-out cross-validation for large data sets
1st Symposium on Advances in Approximate Bayesian Inference, 2018 1 5 Bayesian leave-one-out cross-validation for large data sets Måns Magnusson Michael Riis Andersen Aki Vehtari Aalto University mans.magnusson@aalto.fi
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationProbabilistic programming and Stan. mc-stan.org
Probabilistic programming and Stan mc-stan.org Outline What is probabilistic programming Stan now Stan in the future A bit about other software Probabilistic programming Probabilistic programming framework
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationBayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine
Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview
More informationBayesian linear regression
Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding
More informationBayesian methods in economics and finance
1/26 Bayesian methods in economics and finance Linear regression: Bayesian model selection and sparsity priors Linear Regression 2/26 Linear regression Model for relationship between (several) independent
More informationEstimating Sparse High Dimensional Linear Models using Global-Local Shrinkage
Estimating Sparse High Dimensional Linear Models using Global-Local Shrinkage Daniel F. Schmidt Centre for Biostatistics and Epidemiology The University of Melbourne Monash University May 11, 2017 Outline
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationSparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference
Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Shunsuke Horii Waseda University s.horii@aoni.waseda.jp Abstract In this paper, we present a hierarchical model which
More informationPart 2: Multivariate fmri analysis using a sparsifying spatio-temporal prior
Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 2: Multivariate fmri analysis using a sparsifying spatio-temporal prior Tom Heskes joint work with Marcel van Gerven
More informationR-squared for Bayesian regression models
R-squared for Bayesian regression models Andrew Gelman Ben Goodrich Jonah Gabry Imad Ali 8 Nov 2017 Abstract The usual definition of R 2 (variance of the predicted values divided by the variance of the
More informationBagging During Markov Chain Monte Carlo for Smoother Predictions
Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods
More informationRegularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics
Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics arxiv:1603.08163v1 [stat.ml] 7 Mar 016 Farouk S. Nathoo, Keelin Greenlaw,
More informationAki Vehtari, Aalto University
Aki Vehtari, Aalto University 1 / 89 Probabilistic machine learning group, Aalto University http://research.cs.aalto.fi/pml/ Bayesian theory and methods, approximative integration, model assessment and
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationLecture 14: Shrinkage
Lecture 14: Shrinkage Reading: Section 6.2 STATS 202: Data mining and analysis October 27, 2017 1 / 19 Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the
More informationIntroduction to Gaussian Process
Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Empirical Bayes, Hierarchical Bayes Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 5: Due April 10. Project description on Piazza. Final details coming
More informationRegularized Regression A Bayesian point of view
Regularized Regression A Bayesian point of view Vincent MICHEL Director : Gilles Celeux Supervisor : Bertrand Thirion Parietal Team, INRIA Saclay Ile-de-France LRI, Université Paris Sud CEA, DSV, I2BM,
More informationStatistical Inference
Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationBayesian model selection: methodology, computation and applications
Bayesian model selection: methodology, computation and applications David Nott Department of Statistics and Applied Probability National University of Singapore Statistical Genomics Summer School Program
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationLecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu
Lecture: Gaussian Process Regression STAT 6474 Instructor: Hongxiao Zhu Motivation Reference: Marc Deisenroth s tutorial on Robot Learning. 2 Fast Learning for Autonomous Robots with Gaussian Processes
More informationBayesian Regression Linear and Logistic Regression
When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationCOS513 LECTURE 8 STATISTICAL CONCEPTS
COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions
More informationIntroduction to Probabilistic Machine Learning
Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning
More informationy Xw 2 2 y Xw λ w 2 2
CS 189 Introduction to Machine Learning Spring 2018 Note 4 1 MLE and MAP for Regression (Part I) So far, we ve explored two approaches of the regression framework, Ordinary Least Squares and Ridge Regression:
More informationActive and Semi-supervised Kernel Classification
Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),
More informationPenalized Loss functions for Bayesian Model Choice
Penalized Loss functions for Bayesian Model Choice Martyn International Agency for Research on Cancer Lyon, France 13 November 2009 The pure approach For a Bayesian purist, all uncertainty is represented
More informationDimension Reduction Methods
Dimension Reduction Methods And Bayesian Machine Learning Marek Petrik 2/28 Previously in Machine Learning How to choose the right features if we have (too) many options Methods: 1. Subset selection 2.
More informationRecent Advances in Bayesian Inference Techniques
Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian
More informationParametric Techniques Lecture 3
Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of
More informationProbabilistic modeling. The slides are closely adapted from Subhransu Maji s slides
Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework
More informationHierarchical Models & Bayesian Model Selection
Hierarchical Models & Bayesian Model Selection Geoffrey Roeder Departments of Computer Science and Statistics University of British Columbia Jan. 20, 2016 Contact information Please report any typos or
More informationGWAS V: Gaussian processes
GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011
More informationSTAT 518 Intro Student Presentation
STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible
More informationThe connection of dropout and Bayesian statistics
The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University
More informationA Review of Pseudo-Marginal Markov Chain Monte Carlo
A Review of Pseudo-Marginal Markov Chain Monte Carlo Discussed by: Yizhe Zhang October 21, 2016 Outline 1 Overview 2 Paper review 3 experiment 4 conclusion Motivation & overview Notation: θ denotes the
More informationGAUSSIAN PROCESS REGRESSION
GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationParametric Techniques
Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure
More informationMS-C1620 Statistical inference
MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents
More informationRidge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation
Patrick Breheny February 8 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/27 Introduction Basic idea Standardization Large-scale testing is, of course, a big area and we could keep talking
More informationLecture 4: Probabilistic Learning
DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationNonparameteric Regression:
Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,
More informationPartial factor modeling: predictor-dependent shrinkage for linear regression
modeling: predictor-dependent shrinkage for linear Richard Hahn, Carlos Carvalho and Sayan Mukherjee JASA 2013 Review by Esther Salazar Duke University December, 2013 Factor framework The factor framework
More informationg-priors for Linear Regression
Stat60: Bayesian Modeling and Inference Lecture Date: March 15, 010 g-priors for Linear Regression Lecturer: Michael I. Jordan Scribe: Andrew H. Chan 1 Linear regression and g-priors In the last lecture,
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationIntroduction. Chapter 1
Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics
More informationCS Homework 3. October 15, 2009
CS 294 - Homework 3 October 15, 2009 If you have questions, contact Alexandre Bouchard (bouchard@cs.berkeley.edu) for part 1 and Alex Simma (asimma@eecs.berkeley.edu) for part 2. Also check the class website
More informationContents. Part I: Fundamentals of Bayesian Inference 1
Contents Preface xiii Part I: Fundamentals of Bayesian Inference 1 1 Probability and inference 3 1.1 The three steps of Bayesian data analysis 3 1.2 General notation for statistical inference 4 1.3 Bayesian
More informationModel Selection for Gaussian Processes
Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal
More informationApproximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)
Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA) Willem van den Boom Department of Statistics and Applied Probability National University
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationConsistent high-dimensional Bayesian variable selection via penalized credible regions
Consistent high-dimensional Bayesian variable selection via penalized credible regions Howard Bondell bondell@stat.ncsu.edu Joint work with Brian Reich Howard Bondell p. 1 Outline High-Dimensional Variable
More informationVariational Dropout via Empirical Bayes
Variational Dropout via Empirical Bayes Valery Kharitonov 1 kharvd@gmail.com Dmitry Molchanov 1, dmolch111@gmail.com Dmitry Vetrov 1, vetrovd@yandex.ru 1 National Research University Higher School of Economics,
More informationMultiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar
Multiple regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Multiple regression 1 / 36 Previous two lectures Linear and logistic
More informationST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks
(9) Model selection and goodness-of-fit checks Objectives In this module we will study methods for model comparisons and checking for model adequacy For model comparisons there are a finite number of candidate
More informationCS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning
CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning Professor Erik Sudderth Brown University Computer Science October 4, 2016 Some figures and materials courtesy
More informationIntegrated Non-Factorized Variational Inference
Integrated Non-Factorized Variational Inference Shaobo Han, Xuejun Liao and Lawrence Carin Duke University February 27, 2014 S. Han et al. Integrated Non-Factorized Variational Inference February 27, 2014
More informationDirect Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina
Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:
More informationDisease mapping with Gaussian processes
EUROHEIS2 Kuopio, Finland 17-18 August 2010 Aki Vehtari (former Helsinki University of Technology) Department of Biomedical Engineering and Computational Science (BECS) Acknowledgments Researchers - Jarno
More informationPareto Smoothed Importance Sampling
Pareto Smoothed Importance Sampling Aki Vehtari Andrew Gelman Jonah Gabry 22 September 216 Abstract Importance weighting is a convenient general way to adjust for draws from the wrong distribution, but
More informationCPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017
CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class
More informationIntroduction to Statistical modeling: handout for Math 489/583
Introduction to Statistical modeling: handout for Math 489/583 Statistical modeling occurs when we are trying to model some data using statistical tools. From the start, we recognize that no model is perfect
More informationDEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE
Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures
More informationGWAS IV: Bayesian linear (variance component) models
GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS IV: Bayesian
More informationHigh-Dimensional Statistical Learning: Introduction
Classical Statistics Biological Big Data Supervised and Unsupervised Learning High-Dimensional Statistical Learning: Introduction Ali Shojaie University of Washington http://faculty.washington.edu/ashojaie/
More informationBayesian Approach 2. CSC412 Probabilistic Learning & Reasoning
CSC412 Probabilistic Learning & Reasoning Lecture 12: Bayesian Parameter Estimation February 27, 2006 Sam Roweis Bayesian Approach 2 The Bayesian programme (after Rev. Thomas Bayes) treats all unnown quantities
More informationBayesian Linear Regression
Bayesian Linear Regression Sudipto Banerjee 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. September 15, 2010 1 Linear regression models: a Bayesian perspective
More information36-720: The Rasch Model
36-720: The Rasch Model Brian Junker October 15, 2007 Multivariate Binary Response Data Rasch Model Rasch Marginal Likelihood as a GLMM Rasch Marginal Likelihood as a Log-Linear Model Example For more
More informationPrinciples of Bayesian Inference
Principles of Bayesian Inference Sudipto Banerjee and Andrew O. Finley 2 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department of Forestry & Department
More informationStatistics 203: Introduction to Regression and Analysis of Variance Penalized models
Statistics 203: Introduction to Regression and Analysis of Variance Penalized models Jonathan Taylor - p. 1/15 Today s class Bias-Variance tradeoff. Penalized regression. Cross-validation. - p. 2/15 Bias-variance
More informationBayesian Grouped Horseshoe Regression with Application to Additive Models
Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu 1,2, Daniel F. Schmidt 1, Enes Makalic 1, Guoqi Qian 2, John L. Hopper 1 1 Centre for Epidemiology and Biostatistics,
More informationRegression, Ridge Regression, Lasso
Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationPrinciples of Bayesian Inference
Principles of Bayesian Inference Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department of Forestry & Department
More informationLinear model selection and regularization
Linear model selection and regularization Problems with linear regression with least square 1. Prediction Accuracy: linear regression has low bias but suffer from high variance, especially when n p. It
More informationChris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010
Model-Averaged l 1 Regularization using Markov Chain Monte Carlo Model Composition Technical Report No. 541 Department of Statistics, University of Washington Chris Fraley and Daniel Percival August 22,
More informationPMR Learning as Inference
Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationIntroduction: MLE, MAP, Bayesian reasoning (28/8/13)
STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this
More informationMachine Learning Linear Regression. Prof. Matteo Matteucci
Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares
More information