Automatic Relevance Determination

Similar documents
Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

GAUSSIAN PROCESS REGRESSION

Kernel methods, kernel SVM and ridge regression

Relevance Vector Machines

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Support Vector Machine (continued)

Introduction to SVM and RVM

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Model Selection for Gaussian Processes

Support Vector Machines

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Perceptron Revisited: Linear Separators. Support Vector Machines

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Lecture 10: A brief introduction to Support Vector Machine

PREDICTING SOLAR GENERATION FROM WEATHER FORECASTS. Chenlin Wu Yuhan Lou

Lecture : Probabilistic Machine Learning

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Nonparameteric Regression:

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Neutron inverse kinetics via Gaussian Processes

CS-E3210 Machine Learning: Basic Principles

Bayesian Framework for Least Squares Support Vector Machine Classifiers, Gaussian Processes and Kernel Fisher Discriminant Analysis

Kernel Methods and Support Vector Machines

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Applied inductive learning - Lecture 7

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Bayesian Support Vector Machines for Feature Ranking and Selection

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Introduction to Gaussian Process

Lecture 5: GPs and Streaming regression

20: Gaussian Processes

Introduction to Gaussian Processes

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Outline Lecture 2 2(32)

Gaussian Processes in Machine Learning

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Announcements. Proposals graded

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

CMU-Q Lecture 24:

Lecture 10: Support Vector Machine and Large Margin Classifier

EM-algorithm for Training of State-space Models with Application to Time Series Prediction

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Machine Learning. Support Vector Machines. Manfred Huber

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machine

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Learning with kernels and SVM

Support Vector Machine (SVM) and Kernel Methods

Basis Expansion and Nonlinear SVM. Kai Yu

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Student-t Process as Alternative to Gaussian Processes Discussion

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Sparse Kernel Machines - SVM

Gaussian Fitting Based FDA for Chemometrics

Tokamak profile database construction incorporating Gaussian process regression

Introduction to Machine Learning

Jeff Howbert Introduction to Machine Learning Winter

Pattern Recognition 2018 Support Vector Machines

Classifier Complexity and Support Vector Classifiers

9.2 Support Vector Machines 159

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Support Vector Regression (SVR) Descriptions of SVR in this discussion follow that in Refs. (2, 6, 7, 8, 9). The literature

Linear Models for Regression

Lecture 9: PGM Learning

A Tutorial on Support Vector Machine

Multiple Kernel Learning

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Probabilistic & Unsupervised Learning

CIS 520: Machine Learning Oct 09, Kernel Methods

Bayesian Framework for Least-Squares Support Vector Machine Classifiers, Gaussian Processes, and Kernel Fisher Discriminant Analysis

Learning Kernel Parameters by using Class Separability Measure

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Gaussian Process Regression

Probabilistic Graphical Models Lecture 20: Gaussian Processes

STA414/2104 Statistical Methods for Machine Learning II

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Expectation Propagation for Approximate Bayesian Inference

Gaussian Process Regression Networks

Bayesian Linear Regression. Sargur Srihari

Machine learning - HT Basis Expansion, Regularization, Validation

Support Vector Machine (SVM) and Kernel Methods

FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Transcription:

Automatic Relevance Determination Elia Liitiäinen (eliitiai@cc.hut.fi) Time Series Prediction Group Adaptive Informatics Research Centre Helsinki University of Technology, Finland October 24, 2006

Introduction Automatic Relevance Determination is a classical method based on Bayesian interference. In this presentation we show how it can be applied to Least Squares Support Vector Machines. As a result we get a method for estimating hyperparameters and choosing inputs. 2/19

Outline 1 Least Squares Support Vector Machines 2 Bayesian Interference for Model Parameter Selection 3 Experiment 4 Conclusion 3/19

Least Squares Support Vector Machines We assume that the dataset (y i,x i ) N i=1 is available. The inputs (x i ) are in R n for a finite n. Assume φ : R n H is a mapping to some high (infinite) dimensional space. We model the outputs y by y = ω T φ(x) + b. As is common, we won t do totally rigorous mathematical analysis. 4/19

The Cost Function LS-SVM differs from SVM in the cost function: I = γe 1 + ξe 2 = γ 2 ω 2 + ξ 2 N ei 2, (1) i=1 where e i = (y i ω T φ(x i ) b). Note that we use two hyperparameters to get a Bayesian interpretation. 5/19

Optimization of the Cost The mapping φ is very hard to handle as such. The solution: we require φ(x) T φ(y) = K(x,y) for some kernel K. The kernel often contains an additional parameter. By a simple manipulation with the Lagrangian, it can be seen that the approximator becomes y(x) = N α i K(x,x i ) + b (2) i=1 with the condition N i=1 α i = 0 and a L 2 regularized cost function. Thus we have ended up with a well-known Gaussian process model. Solving for α i is elementary (this is in fact a form of RBF). 6/19

Bayesian Interference We skip the philosophical questions behind Bayesian methods. Automatic Relevance Determination is based on Bayesian interference on three levels. ARD is a classical method and can be applied to many other models (MLP,RBF...). In what follows, H denotes the model and D is the data. We assume no prior knowledge of the problem which means that flat priors are used whenever necessary. 7/19

First Level of Interference Assume that the sample (x i,y i ) is iid. Recall the cost I = γe 1 + ξe 2. (3) In the first level the hyperparameters γ and ξ are assumed to be fixed. We assume the prior p(w) exp( γ w 2 ). For the observations we assume p(y i x i,ω,b,ξ, H) exp( ξ 2 e2 i ). This is a model with a Gaussian prior and a Gaussian noise model. 8/19

Interference on the First Level With the assumptions of the previous slide, we get p(ω,b D,γ,ξ, H) exp( I(D,γ,ξ,ω,b)) (4) It follows that given the hyperparameters, finding the maximum likelihood for p(ω,b D,γ,ξ, H) is equivalent to minimizing the cost I. 9/19

Second Level of Interference In the second level we examine p(ξ,γ D, H). We write p(ξ,γ D, H) p(d w,b, H)p(w,b ξ,γ, H)p(ξ,γ H)dwdb We assume a non-informative prior for the hyperparameters. This can be solved in closed form. Thus no approximation is needed on the second level. 10/19

The Cost Function on the Second Level (1) Using the previously derived formula, we get p(ξ,γ D, H) γn f /2 ξ N/2 H 1/2 exp( I(ω MAP,b MAP )). (5) Here H is the Hessian of the cost function and n f is the dimension of the space in which φ maps the inputs. Typically n f >> 1 and the Hessian H is not available as such. However, it turns out that this is not a problem. 11/19

The Cost Function on the Second Level (2) By recalling the condition φ T (x i )φ(x j ) = K(x i,x j ), it is possible to derive a maximum likelihood cost function for the hyperparameters. The compute a value of the cost, a first level optimization must be done together with solving the eigenvalues of the so-called centered Gram matrix. The optimization problem is one dimensional. The derivation in the paper can certainly be done without the SVM context. 12/19

The Third Level of Interference Recall that by H we denoted the model structure (including kernel parameters, selected inputs). In the third level we write (assuming non-informative priors) p(d H) = p(d γ, ξ, H)P(ξ, γ H)dξdγ p(d γ MAP,ξ MAP, H)D γ D ξ. (6) The terms D γ and D ξ are the second derivatives of the second level cost function at the optimum. All the approximations made are well-known. 13/19

Input Selection Now that we can evaluate the evidence p(d H) of models, input selection is easy. A combination of inputs is evaluated by doing the three level of interference to calculate kernel parameters and hyperparameters. Scaling of input variables is implemented in the same way. All this is already done in the LS-SVM toolbox. 14/19

Practical Point of View The method seems too heavy for many applications (this holds to LS-SVM in general). In the course we will use second level interference. An alternative to ARD is to use LOO or use them together. LOO has a computational cost of the same order. 15/19

Experiment We examine a linear model with ten Gaussian inputs and Gaussian noise. Backward selection with ARD is made. 16/19

Results The first experiment was done with noise 15% of the output. The second experiment was done with noise 30% of the output. The first experiment was solved optimally. The results of the second are in the figure. 17/19

Results (2) Best Possible MSE 15 10 5 Optimal LS SVM LARS 0 0 2 4 6 8 10 12 Steps Figure: The best possible MSE for backward selection with optimal inputs, ARD chosen inputs and LARS chosen inputs. 18/19

Conclusion ARD is a classical method for choosing model parameters. In this presentation we showed how to use it for LS-SVM. The resulting algorithm is heavy to calculate but fully automatic. In the project work we will combine ARD with grid-search for hyperparameter estimation. 19/19