Relevance Vector Machines

Size: px

Start display at page:

Download "Relevance Vector Machines"

Brenda Hampton
6 years ago
Views:

1 LUT February 21, 2011

2 Support Vector Machines Model / Regression Marginal Likelihood Regression Relevance vector machines Exercise

3 Support Vector Machines The relevance vector machine (RVM) is a bayesian sparse kernel technique for regression and classification Solves some problems with the support vector machines (SVM) Used in detection and classification. Detecting cancer cells, classificating DNA sequences... etc.

4 Support Vector Machines Support Vector Machines (SVM) A non-probabilistic decision machine. Returns point estimate for regression and binary decision for classification. Makes decisions based on the function: y(x; w) = w i K(x, x i ) + w 0 (1) where K is the kernel function and w 0 is the bias. Attempts to minimize the error while simultaneously maximize the margin between the two classes.

5 Support Vector Machines Support Vector Machines (SVM) y = 1 y = 0 y = 1 y = 1 y = 0 y = 1 margin

6 Support Vector Machines SVM Problems The number of required support vectors typically grows linearly with the size of the training set Non-probabilistic predictions. Requires estimation of error/margin trade-off parameters K(x, x i ) must satisfy mercel s condition.

7 Model / Regression Marginal Likelihood Apply bayesian treatment to SVM. Associates a prior over the model weights governed by a set of hyperparameters. Posterior distributions of the majority of weights are peaked around zero. Training vectors associated with the non-zero weights are the relevance vectors. Typically utilizes fewer kernel functions than SVM.

8 The model Outline Model / Regression Marginal Likelihood For given data set of input-target pairs {x n, t n } N n=1 t n = y(x n ; w) + ɛ n (2) where ɛ n are samples from some noise process which is assumed to be mean-zero Gaussian with variance σ 2. Thus, p(t n x) = N (t n y(x n ), σ 2 ) (3)

9 The model (cont.) Outline Model / Regression Marginal Likelihood encode sparsity in the prior. p(w α) = N i=0 which is Gaussian, but conditioned on α. N (w i 0, α 1 i ) (4) we must define hyperpriors over all α m to complete the specification of hierarchical prior: p(w m ) = p(w m α m )p(α m )dα m (5)

10 Regression Outline Model / Regression Marginal Likelihood The model has independent Gaussian noise: t n N (y(x n ; w), σ 2 ) Corresponding likelihood: { p(t w, σ 2 ) = (2πσ 2 ) N/2 exp 1 } t Φw 2 2σ2 (6) where t = (t q,..., t N ), w = (w q,..., w M ) and Φ is the NxM design matrix with Φ n m = φ m (x n )

11 The model (cont.) Outline Model / Regression Marginal Likelihood The desired posterior over all unknowns: p(w, α, σ 2 t) = p(t w, α, σ2 )p(w, α, σ 2 ) p(t) (7) When given a new test point, x, predictions are made for the corresponding target t, in terms of predictive distribution: p(t t) = p(t w, α, σ 2 )p(w, α, σ 2 t)dwdαdσ 2 (8) But we have a problem here. We cannot perform these computations analytically. Approximations are needed.

12 The model (cont.) Outline Model / Regression Marginal Likelihood We need to decompose the posterior as: p(w, α, σ 2 t) = p(w t, α, σ 2 )p(α, σ 2 t) (9) And so, the posterior distribution over the weights is: p(w t, α, σ 2 ) = p(t w, α, σ2 )p(w α) p(t α, σ 2 ) N (w µ, Σ) (10) where Σ = (σ 2 Φ T Φ + A) 1 (11) µ = σ 2 ΣΦ T t (12)

13 Marginal Likelihood Outline Model / Regression Marginal Likelihood Marginal Likelihood can be written as p(t α, σ 2 ) = p(t w, σ 2 )p(w α)dw (13) Maximizing the marginal likelyhood function is known as the type-ii maximum likelihood method. We must optimize p(t α, σ 2 ). There are a few ways to do this.

14 Marginal Likelihood optimization Model / Regression Marginal Likelihood Maximizes (13) with iterative re-estimation. Differentiating logp(t α, σ 2 ) gives iterative re-estimation approach: αi new = γ i µ 2 i (14) (σ 2 ) new t Φµ 2 = N Σ M i=1 γ i where we have defined quantities as γ i = 1 α i Σ ii. γ i is a measure of how well-determined is the parameter w i (15)

15 Model / Regression Marginal Likelihood RVMs for classification The likelihood P(t w) is now Bernoulli: P(t w) = N g{y(x n ; w)} t n[1 g{y(x n ; w)}] 1 tn (16) n=1 with g(y) = 1/(1 + e y ) the sigmoid function. No noise variance, same sparse prior as regression. Unlike regression, The weight posteriors p(w t, α) cannot be obtained analytically. Approximations are once again needed.

16 Model / Regression Marginal Likelihood Gaussian posterior approximation Find posterior mode w M P for current values of α by using optimization Compute Hessian Negate and invert to give the covariance for a gaussian approximation p(w t, α) N (w M P, Σ) α are updated using µ and Σ.

17 Regression RVM Regression Example sinc function: sinc(x) = sin(x)/x Linear spline kernel: K(x m, x n ) = 1 + x m x n + x m x n min(x m, x n ) xm+xn 2 min(x m, x m ) 2 + min(xm,xn)3 3 with ɛ = 0.01, 100 uniform, noise-free samples.

18 RVM Regression Example Regression

19 RVM Regression Example Regression

20 Regression RVM Example Ripley s synthetic data Gaussian kernel: K(x m, x n ) = exp( r 2 ) x m x n 2 with r = 0.5

21 RVM Example Regression

22 Relevance vector machines Exercise Sparsity: the prediction of new inputs depend on the kernel function evaluated at a subset of the training data points. TODO More detailed explanation in the original publication: Tipping M., Sparse Bayesian Learning and the Relevance Vector Machine, Journal of Machine Learning Research 1, 2001, pp

23 Relevance vector machines Exercise Exercise Fetch Tipping s matlab toolbox for sparse bayes from http: // Try SparseBayesDemo.m with different likelihood models (Gaussian, Bernoulli...) and familiarize yourself with the toolbox Try to replicate results from the regression example.

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview