Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Similar documents
Relevance Vector Machines

Fast Marginal Likelihood Maximisation for Sparse Bayesian Models

Recent Advances in Bayesian Inference Techniques

Introduction to SVM and RVM

Lecture 1b: Linear Models for Regression

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

STA414/2104 Statistical Methods for Machine Learning II

Pattern Recognition and Machine Learning

CS-E3210 Machine Learning: Basic Principles

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Cheng Soon Ong & Christian Walder. Canberra February June 2018

STA 4273H: Sta-s-cal Machine Learning

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Adaptive Sparseness Using Jeffreys Prior

Bayesian Machine Learning

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Approximate Inference Part 1 of 2

STA 4273H: Statistical Machine Learning

Approximate Inference Part 1 of 2

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Bayesian methods in economics and finance

GAUSSIAN PROCESS REGRESSION

Variational Approximate Inference in Latent Linear Models

Variational Methods in Bayesian Deconvolution

Bayesian Regression Linear and Logistic Regression

Variational Dropout via Empirical Bayes

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture : Probabilistic Machine Learning

Neutron inverse kinetics via Gaussian Processes

Sparse Bayesian Learning and the Relevance Vector Machine

CPSC 540: Machine Learning

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning

Ch 4. Linear Models for Classification

Logistic Regression. COMP 527 Danushka Bollegala

Bayesian Logistic Regression

COMP90051 Statistical Machine Learning

Gaussian Process Regression

Learning with L q<1 vs L 1 -norm regularisation with exponentially many irrelevant features

Global Optimisation with Gaussian Processes. Michael A. Osborne Machine Learning Research Group Department o Engineering Science University o Oxford

Bayesian Methods for Sparse Signal Recovery

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Model Selection for Gaussian Processes

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Variational Principal Components

Probabilistic & Unsupervised Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Confidence Estimation Methods for Neural Networks: A Practical Comparison

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear Classification

Smooth Bayesian Kernel Machines

CPSC 540: Machine Learning

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

CSC 2541: Bayesian Methods for Machine Learning

Hierarchical sparse Bayesian learning for structural health monitoring. with incomplete modal data

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Variational Bayesian Logistic Regression

Linear Models for Regression

TDA231. Logistic regression

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Gaussian Processes in Machine Learning

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Machine Learning Lecture 7

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Predictive Ensemble Pruning by Expectation Propagation

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

STAT 518 Intro Student Presentation

PATTERN RECOGNITION AND MACHINE LEARNING

Approximate Inference using MCMC

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

INTRODUCTION TO PATTERN RECOGNITION

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Where now? Machine Learning and Bayesian Inference

Bayesian Machine Learning

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Outline Lecture 2 2(32)

Linear & nonlinear classifiers

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lossless Online Bayesian Bagging

Machine Learning Techniques for Computer Vision

Sparse Kernel Density Estimation Technique Based on Zero-Norm Constraint

Linear Models for Classification

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

The connection of dropout and Bayesian statistics

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Transcription:

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK

Lecture 3: Overview A hierarchical prior for sparse linear models Sparse regression: (approximate) inference procedure Insight into sparsity Sparse classification: a further approximation Contrast with the support vector machine We won t cover all the material in the notes in this lecture...

Bayes and Contemporary Machine Learning We ve seen that marginalisation is a valuable component of the Bayesian data modelling paradigm We also saw that full Bayesian inference is often analytically intractable, although approximations for simple linear models could be very effective Historically, interest in Bayesian machine learning (but not statistics!) has focussed on approximations for non-linear models, e.g. neural networks: The evidence procedure [MacKay] Hybrid Monte Carlo sampling [Neal] Good news! flexible linear kernel methods have become very fashionable, thanks mainly to the support vector machine

Linear Models and Sparsity Much interest in linear models has focused on sparse learning algorithms, which set many weights w m to zero in the function y(x) mwm m(x) Sparsity offers: Elegant complexity control Feature extraction Elucidation Speed and compactness...

p( How To Encode Sparsity? In the algorithm: Heuristic, greedy, sequential techniques (e.g. matching pursuit ) In the objective function: E(w) E (w) EW (w) usually via the penalty term E W (w) but can be via the data error term E (w) (e.g. the SVM) In the prior: p(w ) p( ) w) p(w) Recall: correspondence between E W (w) and log p(w)

wm wm wm Penalty Terms & Priors Most common regularisation term: E W (w) M m 1 Corresponds to a Gaussian prior p(w) exp m wm Easy to work with & an effective regulariser, but no sparsity pressure Correct term would be E W (w) m 0, but this is very ugly to work with! Instead, E W (w)m wm 1 is a workable compromise which gives reasonable sparsity and reasonable tractability: LP-machines Basis pursuit Bayesian Pruning, as a Laplace prior p(w) exp ( m )

1 A Sparse Bayesian Prior In fact, we can obtain sparsity by retaining the traditional Gaussian prior This is great news for tractability Furthermore, this approach leads to scale invariance The modification to our earlier Gaussian prior is subtle: p(w 1 M) ( M ) M m 1 m 1 exp 1 M m 1 mw m What s new? We now have M hyperparameters ( independently controlling the (inverse) variance of each weight w m M), one m

Hierarchical Priors The prior p(w ) is Gaussian, but conditioned on We should now define hyperpriors over all m Previously, we utilised a log-uniform hyperprior this is a special case of a more general Gamma hyperprior This gives a hierarchical prior: we must marginalise to see its true form p(w m ) p(w m m) p(m) d m For a Gamma p(m), p(w m ) is a Student-t distribution Its equivalent as a penalty function would be: m log wm

Priors Compared Specifying independent hyperparameters m is the key to sparsity Example marginal priors p(w 1 w ) illustrated below: Gaussian prior Marginal prior: single α Independent α

w w A Sparse Bayesian Model for Regression As before: independent Gaussian noise: t n N(y(x n w) ) This gives a corresponding likelihood: p(t ) ( ) N exp 1 t t (t1 w (w1 tn )T wm )T is the N M design matrix with nm m(xn)

w Approximate Inference Desire posterior over all unknowns: p(w t) p(t )p(w p(t) ) Can t compute analytically! So as previously, decompose as: p(w t) p(w t ) p( t) p(w t ), the weight posterior distribution, is tractable p( t) must be approximated

w The Weight Posterior Term Given the data, the posterior distribution over weights is Gaussian: p(w t ) p(t p(t ) p(w ) ) ( ) (N 1) 1 exp 1 (w ) T 1 (w ) with ( T A) 1 T t A diag( 1 M) Key point: if m, the corresponding m 0

w and log The Hyperparameter Posterior Term As before, for uniform hyperpriors over log, p( t) p(t ) Integrate out weights to obtain the marginal likelihood: p(t ) p(t ) p(w ) dw ( ) N I 1 A T 1 exp 1 tt ( I A 1 T ) 1 t We maximise p(t ) to find MP and MP ( type-ii maximum likelihood ) In Lecture, we found the single MP empirically For multiple hyperparameters, we must optimise p(t ) directly

Hyperparameter Re-estimation Differentiating log p(t ) gives iterative re-estimation formulae: new i i i ( ) new N t M i 1 i For convenience we have defined the quantities i 1 i ii i [0 1] is a measure of well-determinedness of parameter w i

i w MP Summary of Inference Procedure ➊ Initialise i and (or fix latter if known) ➋ Compute weight posterior sufficient statistics and ➌ Compute all, then re-estimate i (and if desired) ➍ Repeat from ➋ until convergence ➎ Make predictions for new data via the predictive distribution: p(t t) p(t MP ) p(w t MP ) dw the mean of which is y(x ) ➏ Can delete basis functions for which optimal i, since i 0

m [ Sparsity: Algorithmic Insight ML(w): conventional maximum likelihood (or least-squares) optimises M Since 1 parameters, w and 0, p(t w) is a No Ockham penalty! -function located at t ML( ): maximising the marginal likelihood optimises M p(t 1 parameters, and ) is a zero-mean Gaussian process with covariance C M m 1 1 m m T m I where m(x 1 ) m(x N )] T

C M m 1 1 m m T m I

Support Vector Regression Relevance Vector Regression Max error: 0.0896 Noise: 0.100 Max error: 0.0664 Noise: 0.100 RMS error: 0.040 SV s: 9 C and ε found by cross validation RMS error: 0.03 RV s: 7 Estimate: 0.107

Sparse Bayesian Classification The likelihood is now Bernoulli, rather than Gaussian: P(t w) N n 1 g y(x n w) t n 1 g y(x n w) 1 t n with g(y) 1 (1 e y) the logistic sigmoid function Note: no noise variance Same sparse prior p(w ) as for regression Unlike for regression and the Gaussian likelihood model, p(w t) cannot be obtained analytically, we so utilise a Laplace (Gaussian) approximation

Classification: Gaussian Posterior Approximation For the current values of, find the posterior mode wmp via optimisation Compute the Hessian at w MP : w w log p(w t ) wmp ( T B A) where B nn g y(x n w MP ) 1 g y(x n w MP ) Negate and invert to give the covariance for a Gaussian approximation p(w t) N(w MP ) Hyperparameters are updated as before using the approximated and Note that p(w t ) is log-concave

Support Vector Machines A state-of-the-art method for classification and regression Given data set comprising N input vectors x n, model has form y(x w) N n 1 w n K (x x n ) w0 As many kernel functions K ( x n ) as examples M N 1 parameters Support vector learning: minimise objective function of form: E(w) E (w) (size of margin) gives excellent accuracy (particularly in classification) as a side-effect, many w n get set to zero the model is sparse

The Relevance Vector Machine (RVM) The RVM is simply a sparse Bayesian model utilising the same data-dependent kernel basis as the SVM: y(x w) N n 1 w n K (x x n ) w0 Demonstration of RVM classification...

SVM: error=9.48% vectors=44 RVM: error=9.3% vectors=3

SVM: error=10.97% vectors=38 RVM: error=10.43% vectors=3

Comparison with the SVM General observations: RVM gives better generalisation in regression (?) SVM gives better generalisation in classification (?) RVM is much sparser (but the SVM is not designed to be sparse) There are other advantages of a Bayesian approach: no nuisance parameters to set posterior probabilities in classification error bars in regression principled method for more than two classes not limited to Mercer kernels potential to estimate input scale parameters and compare kernels

Choosing the Kernel MP As in the SVM, we must choose the kernel and set any associated parameter(s) A Bayesian could compare alternative kernels by computing the fully marginalised probabilities of the data under candidate models. e.g.: p(t 1) p(t 1) p( 1) d We already know this integral isn t analytically tractable Approximation via sampling (as in Lecture ) is not feasible for multiple Deterministic approximations to this integral have proved inaccurate But p(t ) is a reasonable criterion for choosing kernels

Kernel Input Scale Parameters For example, consider choice of in K (x x n ) exp x xn SVM: cross-validation can be used to set scale parameter RVM: we can optimise the marginal likelihood function with respect to Furthermore, we can optimise multiple scale parameters d input dimensions:, one for each of the K (x x n ) exp d k 1 k (x k xnk) Implementing sparsity of input variables (q.v. Gaussian process models)

Impact on Regression Benchmarks -RVR: optimisation over both and Test error Dataset SVR RVR # kernels -RVR SVR RVR -RVR Friedman #1.9.80 0.7 116.6 59.4 11.5 Friedman # 4140 3505 593 110.3 6.9 3.9 Friedman #3 0.00 0.0164 0.0119 106.5 11.5 6.4 Friedman #1: 10-dimensional input space, but function depends only on variables 1 5. Final -values shown on right: 1 3 4 5 6 7 8 9 10

Ockham s Razor and Kernel Width

Summary Experience appears to show that the sparse Bayesian learning procedure works highly effectively: generalisation performance is typically very good models are typically extremely (near-optimally?) sparse Challenge: obtain a reliable approximation to integrating out comparison) (for model In Lecture 4, we ll look at: Properties of the marginal likelihood function An efficient algorithm for optimising Extensions and applications of sparse Bayesian models

Regression Performance Illustration errors vectors Data set N d SVM RVM SVM RVM Sinc (Gaussian noise) 100 1 0.378 0.36 45. 6.7 Sinc (Uniform noise) 100 1 0.15 0.187 44.3 7.0 Friedman #1 40 10.9.80 116.6 59.4 Friedman # 40 4 4140 3505 110.3 6.9 Friedman #3 40 4 0.00 0.0164 106.5 11.5 Boston Housing 481 13 8.04 7.46 14.8 39.0 Normalised Mean 1.00 0.86 1.00 0.15

Classification Performance Illustration errors vectors Data set N d SVM RVM SVM RVM Pima Diabetes 00 8 0.1% 19.6% 109 4 U.S.P.S. 791 56 4.4% 5.1% 540 316 Banana 400 10.9% 10.8% 135. 11.4 Breast Cancer 00 9 6.9% 9.9% 116.7 6.3 Titanic 150 3.1% 3.0% 93.7 65.3 Waveform 400 1 10.3% 10.9% 146.4 14.6 German 700 0.6%.% 411. 1.5 Image 1300 18 3.0% 3.9 % 166.6 34.6 Normalised Mean 1.00 1.08 1.00 0.17