Relevance Vector Machines

Similar documents
Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Introduction to SVM and RVM

Pattern Recognition and Machine Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Fast Marginal Likelihood Maximisation for Sparse Bayesian Models

CSC 411: Lecture 04: Logistic Regression

Recent Advances in Bayesian Inference Techniques

Linear & nonlinear classifiers

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear Models for Regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Neutron inverse kinetics via Gaussian Processes

Linear Models for Regression CS534

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Outline Lecture 2 2(32)

Machine Learning Lecture 7

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Machine Learning. 7. Logistic and Linear Regression

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

ECE521 week 3: 23/26 January 2017

STA 4273H: Statistical Machine Learning

Relevance Vector Machines for Earthquake Response Spectra

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

GAUSSIAN PROCESS REGRESSION

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Least Squares Regression

Machine Learning Practice Page 2 of 2 10/28/13

Lecture 5: GPs and Streaming regression

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Support Vector Machine (continued)

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Lecture 1b: Linear Models for Regression

STA 4273H: Sta-s-cal Machine Learning

Linear Classification

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Ch 4. Linear Models for Classification

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Linear & nonlinear classifiers

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Outline lecture 2 2(30)

Least Squares Regression

Kernel Methods and Support Vector Machines

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Overfitting, Bias / Variance Analysis

Bayesian Logistic Regression

Bayesian methods in economics and finance

Linear Models for Classification

10-701/ Machine Learning - Midterm Exam, Fall 2010

Introduction to Gaussian Process

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

CS534 Machine Learning - Spring Final Exam

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

STA414/2104 Statistical Methods for Machine Learning II

Regression with Numerical Optimization. Logistic

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP

Multivariate Bayesian Linear Regression MLAI Lecture 11

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

CS798: Selected topics in Machine Learning

Bayesian Machine Learning

CS-E3210 Machine Learning: Basic Principles

Lecture : Probabilistic Machine Learning

Bayesian Linear Regression. Sargur Srihari

Reading Group on Deep Learning Session 1

arxiv: v3 [cs.lg] 13 Jun 2018

Midterm: CS 6375 Spring 2015 Solutions

10-701/ Machine Learning, Fall

Linear Models for Regression

CMU-Q Lecture 24:

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Nonparameteric Regression:

Linear Models for Regression CS534

Linear Models for Regression CS534

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Supervised Learning Coursework

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Midterm exam CS 189/289, Fall 2015

Bayesian Learning (II)

Maximum Direction to Geometric Mean Spectral Response Ratios using the Relevance Vector Machine

Gaussian Process Regression

Perspectives on Sparse Bayesian Learning

Transcription:

LUT February 21, 2011

Support Vector Machines Model / Regression Marginal Likelihood Regression Relevance vector machines Exercise

Support Vector Machines The relevance vector machine (RVM) is a bayesian sparse kernel technique for regression and classification Solves some problems with the support vector machines (SVM) Used in detection and classification. Detecting cancer cells, classificating DNA sequences... etc.

Support Vector Machines Support Vector Machines (SVM) A non-probabilistic decision machine. Returns point estimate for regression and binary decision for classification. Makes decisions based on the function: y(x; w) = w i K(x, x i ) + w 0 (1) where K is the kernel function and w 0 is the bias. Attempts to minimize the error while simultaneously maximize the margin between the two classes.

Support Vector Machines Support Vector Machines (SVM) y = 1 y = 0 y = 1 y = 1 y = 0 y = 1 margin

Support Vector Machines SVM Problems The number of required support vectors typically grows linearly with the size of the training set Non-probabilistic predictions. Requires estimation of error/margin trade-off parameters K(x, x i ) must satisfy mercel s condition.

Model / Regression Marginal Likelihood Apply bayesian treatment to SVM. Associates a prior over the model weights governed by a set of hyperparameters. Posterior distributions of the majority of weights are peaked around zero. Training vectors associated with the non-zero weights are the relevance vectors. Typically utilizes fewer kernel functions than SVM.

The model Outline Model / Regression Marginal Likelihood For given data set of input-target pairs {x n, t n } N n=1 t n = y(x n ; w) + ɛ n (2) where ɛ n are samples from some noise process which is assumed to be mean-zero Gaussian with variance σ 2. Thus, p(t n x) = N (t n y(x n ), σ 2 ) (3)

The model (cont.) Outline Model / Regression Marginal Likelihood encode sparsity in the prior. p(w α) = N i=0 which is Gaussian, but conditioned on α. N (w i 0, α 1 i ) (4) we must define hyperpriors over all α m to complete the specification of hierarchical prior: p(w m ) = p(w m α m )p(α m )dα m (5)

Regression Outline Model / Regression Marginal Likelihood The model has independent Gaussian noise: t n N (y(x n ; w), σ 2 ) Corresponding likelihood: { p(t w, σ 2 ) = (2πσ 2 ) N/2 exp 1 } t Φw 2 2σ2 (6) where t = (t q,..., t N ), w = (w q,..., w M ) and Φ is the NxM design matrix with Φ n m = φ m (x n )

The model (cont.) Outline Model / Regression Marginal Likelihood The desired posterior over all unknowns: p(w, α, σ 2 t) = p(t w, α, σ2 )p(w, α, σ 2 ) p(t) (7) When given a new test point, x, predictions are made for the corresponding target t, in terms of predictive distribution: p(t t) = p(t w, α, σ 2 )p(w, α, σ 2 t)dwdαdσ 2 (8) But we have a problem here. We cannot perform these computations analytically. Approximations are needed.

The model (cont.) Outline Model / Regression Marginal Likelihood We need to decompose the posterior as: p(w, α, σ 2 t) = p(w t, α, σ 2 )p(α, σ 2 t) (9) And so, the posterior distribution over the weights is: p(w t, α, σ 2 ) = p(t w, α, σ2 )p(w α) p(t α, σ 2 ) N (w µ, Σ) (10) where Σ = (σ 2 Φ T Φ + A) 1 (11) µ = σ 2 ΣΦ T t (12)

Marginal Likelihood Outline Model / Regression Marginal Likelihood Marginal Likelihood can be written as p(t α, σ 2 ) = p(t w, σ 2 )p(w α)dw (13) Maximizing the marginal likelyhood function is known as the type-ii maximum likelihood method. We must optimize p(t α, σ 2 ). There are a few ways to do this.

Marginal Likelihood optimization Model / Regression Marginal Likelihood Maximizes (13) with iterative re-estimation. Differentiating logp(t α, σ 2 ) gives iterative re-estimation approach: αi new = γ i µ 2 i (14) (σ 2 ) new t Φµ 2 = N Σ M i=1 γ i where we have defined quantities as γ i = 1 α i Σ ii. γ i is a measure of how well-determined is the parameter w i (15)

Model / Regression Marginal Likelihood RVMs for classification The likelihood P(t w) is now Bernoulli: P(t w) = N g{y(x n ; w)} t n[1 g{y(x n ; w)}] 1 tn (16) n=1 with g(y) = 1/(1 + e y ) the sigmoid function. No noise variance, same sparse prior as regression. Unlike regression, The weight posteriors p(w t, α) cannot be obtained analytically. Approximations are once again needed.

Model / Regression Marginal Likelihood Gaussian posterior approximation Find posterior mode w M P for current values of α by using optimization Compute Hessian Negate and invert to give the covariance for a gaussian approximation p(w t, α) N (w M P, Σ) α are updated using µ and Σ.

Regression RVM Regression Example sinc function: sinc(x) = sin(x)/x Linear spline kernel: K(x m, x n ) = 1 + x m x n + x m x n min(x m, x n ) xm+xn 2 min(x m, x m ) 2 + min(xm,xn)3 3 with ɛ = 0.01, 100 uniform, noise-free samples.

RVM Regression Example Regression

RVM Regression Example Regression

Regression RVM Example Ripley s synthetic data Gaussian kernel: K(x m, x n ) = exp( r 2 ) x m x n 2 with r = 0.5

RVM Example Regression

Relevance vector machines Exercise Sparsity: the prediction of new inputs depend on the kernel function evaluated at a subset of the training data points. TODO More detailed explanation in the original publication: Tipping M., Sparse Bayesian Learning and the Relevance Vector Machine, Journal of Machine Learning Research 1, 2001, pp. 211-244

Relevance vector machines Exercise Exercise Fetch Tipping s matlab toolbox for sparse bayes from http: //www.vectoranomaly.com/downloads/downloads.htm. Try SparseBayesDemo.m with different likelihood models (Gaussian, Bernoulli...) and familiarize yourself with the toolbox Try to replicate results from the regression example.