Bayesian Support Vector Machines for Feature Ranking and Selection

Similar documents
Bayesian Support Vector Machines for Feature Ranking and Selection

References. Lecture 7: Support Vector Machines. Optimum Margin Perceptron. Perceptron Learning Rule

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Probabilistic Machine Learning. Industrial AI Lab.

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Logistic Regression. Machine Learning Fall 2018

CMU-Q Lecture 24:

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Announcements. Proposals graded

CSC321 Lecture 18: Learning Probabilistic Models

Introduction to Gaussian Processes

Introduction to SVM and RVM

Model Selection for Gaussian Processes

Bayesian Learning (II)

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Machine Learning for NLP

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Nonparametric Bayesian Methods (Gaussian Processes)

Machine Learning Lecture 7

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Lecture 3: Introduction to Complexity Regularization

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

CIS 520: Machine Learning Oct 09, Kernel Methods

Naïve Bayes classification

Midterm exam CS 189/289, Fall 2015

GAUSSIAN PROCESS REGRESSION

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

COMP90051 Statistical Machine Learning

Statistical Data Mining and Machine Learning Hilary Term 2016

Introduction to Gaussian Process

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Lecture 2 Machine Learning Review

Support Vector Machines

Jeff Howbert Introduction to Machine Learning Winter

CS798: Selected topics in Machine Learning

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Support Vector Machine (SVM) and Kernel Methods

CPSC 340: Machine Learning and Data Mining

Support Vector Machines for Classification: A Statistical Portrait

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Introduction to Support Vector Machines

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Lecture 10: Support Vector Machine and Large Margin Classifier

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Support Vector Machines and Kernel Methods

Introduction to Logistic Regression and Support Vector Machine

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Support Vector Machines.

The Learning Problem and Regularization

Logistic Regression Logistic

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Introduction to Logistic Regression

Where now? Machine Learning and Bayesian Inference

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Parametric Techniques Lecture 3

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Statistical Learning Theory and the C-Loss cost function

Gaussian Processes (10/16/13)

Probabilistic Reasoning in Deep Learning

Advanced Introduction to Machine Learning

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

CS-E3210 Machine Learning: Basic Principles

References. Lecture 3: Shrinkage. The Power of Amnesia. Ockham s Razor

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Support Vector Machine (SVM) and Kernel Methods

ECE521 week 3: 23/26 January 2017

Machine Learning Linear Classification. Prof. Matteo Matteucci

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Bayesian Trigonometric Support Vector Classifier

Parametric Techniques

Expectation Propagation for Approximate Bayesian Inference

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Linear discriminant functions

Graphical Models for Collaborative Filtering

Linear & nonlinear classifiers

Support Vector Machines

18.9 SUPPORT VECTOR MACHINES

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Kernel methods, kernel SVM and ridge regression

Warm up: risk prediction with logistic regression

GWAS V: Gaussian processes

Bayesian Machine Learning

Ways to make neural networks generalize better

Support Vector Machines

Introduction to Machine Learning

Today. Calculus. Linear Regression. Lagrange Multipliers

Transcription:

Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006

Overview 1 Introduction 2 SVM & Bayesian Framework 3 Feature Selection 4 Results

Our Task: Feature Selection Want to learn the mapping function between given data (feature vectors) and target values, to predict the target value for unseen data. f features classes As a sub-problem we want to find the relevant features first, to then only do the learning on a smaller data set. Advantages: decrease computational load and/or increase accuracy.

Before Christmas: Embedded Methods Guided Search Embedded methods: Search is guided by a learning process. Two advantages compared to wrappers: less computationally expensive less prone to overfitting Feature Ranking/Selection with BSVMs Is as well an embedded method. Ranking done with a BSVMs classifier, for selection any learning algorithm can be used.

Support Vector Machines (1/3) m Figure: Schematic illustration of a simple (linear) support vector machine

Support Vector Machines (2/3) What you should already know Sparseness: only support vectors enter the classification rule. Usually much less support vectors than data points. The great power of the support vector machines comes from the kernel trick: we map our features to some high dimensional feature space and compute the discrimination function there f SVM (x) = N α i y i k(x, x i ) + β 0. i=1

Support Vector Machines (3/3) Figure: Radial basis support vector machine

Reproducing Kernel Hilbert Space (1/2) Kernel trick: Some questions... How does this high dimensional space look like? Could one define regularization coefficients in this high-dimensional space? Answered by the theory of Reproducing Kernel Hilbert Space (RKHS), denoted by H throughout the talk. RKHS: a smooth Hilbert space We don t want to have non-smooth discrimination functions f. Hilbert space provides us with all the properties we want, except for the smoothness. Idea: restrict the space of possible functions.

Reproducing Kernel Hilbert Space (2/2) Given some kernel k(x, x ), we construct a Hilbert space, such that k is a dot product in that space. Define Gram matrix. Given data x 1,..., x m, define: K ij = k(x i, x j ) Positive definite K We get a good Hilbert space, if the matrix representation of our kernel is positive definite. Theorem of Mercer provides us with a recipe how to construct such kernels. Reproducing Kernel Hilbert Space H is the vector space containing all linear combinations of the functions k(, x): m f( ) = α i k(, x i ). i=1

Regularized risk for RKHS Just like in standard linear SVMs we try to minimize min R(f) = m l(y i, f (x i )) + 1 f H C f 2 H }{{} i=1 }{{} regularizer loss where H is a norm in H. For a linear SVM this H is w 2, the 2-norm of the parameters that determine the hyperplane. Notice: the kernel is still present implicitly in the loss and regularizer term.

Applying Baysian learning to SVMs (1/2) Bayesian Philosophy The State of Nature is modelled as a random variable. For our problem: the mapping function f is the State of Nature. Double random process 1 Function f is drawn from a family of functions (given by H) according to a distribution P(f). 2 Data drawn from P(D f). One problem with this approach: what is correct P(f)?

Applying Baysian learning to SVMs (2/2) Bayes Theorem P(f D) }{{} posterior P(D f) P(f) }{{} likelihood }{{} prior Bayesian SVMs: combine SVMs and Bayes theory Need to define prior P(f) likelihood P(D f) Bayesian Philosophy f = {f (x i )} i=1...m is now assumed as a RV.

Combine Bayes theorem with regularized risk likelihood Empirical Risk Given a term of the form R emp = m l(y i, f (x i )), i=1 which measures the empiricial risk (loss on training examples). Maximum Likelihood Adapt the maximum likelihood philosophy: favour mapping functions with small error on our training sample! Introduce probabilities: small probabilities for big empirical risks, high probabilities for small empirical risks. One comes up with ( ) m P(D f) exp l(y i, f (x i )). i=1

Combine Bayes theorem with regularized risk prior Gaussian Process Apply a concept called Gaussian Process. We want to favour small values of the regularizer: ( P(f) = exp 1 ) C f 2 H Such a regularizer corresponds to a weight decay, which we already have seen in previous lectures. Such a prior is called Automatic Relevance Determination (ARD). Why do we use Gaussians and exp. functions all over again? Well-known theory from statistical physics: compare to Boltzmann distribution for energy. If one uses a Gaussian for the prior it makes sense to use an exponential function as well for the likelihood, as posterior then gets easier.

Putting it all together Prior for mapping function has the form ( P(f) exp 1 ) C f 2 H and likelihood as P(D f) exp ( ) m l(y i, f (x i )) minimizer of regularized functional can then be interpreted as MAP (maximum a posteriori): i=1 min R(f) f H is equivalent to max P(D f)p(f) f H

Regularizers for Bayesian SVMs (1/3) 2-norm regularization For the linear SVM we used w as the regularizer (see Lecture 7). As a short remainder: w are the parameters of the hyperplane. w = m α i x i y i i=1 Regularization of a function f? How should one define a norm that can be used for the functions in the RKHS. What we want: We don t want too complicated functions (one reason: Vapnik-Chernovnenkis theorem). The kernel used for the RKHS should influence the regularizer.

Regularizers for Bayesian SVMs (2/3) Take a covariance matrix as the regularizer! Compare to the linear SVM (for n features) ( m ) 2 ( m ) 2 w 2 = α i x i,1 y i +... + α i x i,n y i i=1 i=1 Feature Selection General idea: include an ARD (automatic relevance determination) parameter (compare to the scaling parameters of Embedded Methods), and measure the covariance between the outputs corresponding to inputs x i and x j. This is the first time features occur in our formulas (indexed by l).

Regularizers for Bayesian SVMs (3/3) ARD Linear Kernel Cov[f (x i ), f (x j )] = n κ a,l x i,l x j,l + κ b l=1 ARD Gaussian Kernel Cov[f (x i ), f (x j )] = κ 0 exp ( 1 2 ) n κ a,l (x i,l x j,l ) 2 + κ b l=1 Prior for RVs {f (x i )} collect parameters in θ and let Σ denote the m m covariance matrix P(f θ) = 1 ( exp 1 ) Z f 2 ft Σ 1 f.

Likelihood for Bayesian SVMs Assume training data is i.i.d. and likelihood has thus the form m P(D f, θ) = P(y i f (x i )) i=1 usually ln P(y i f (x i )) is referenced as the loss function l(y i, f (x i )) (we want a sum instead of the product). Trigonometric loss function The authors introduce a new trigonometric loss function. Why? As one will have to solve an optimization problem, just like in the standard SVM setting. Sparseness then helps to make the problem computationally tractable. + if y i f (x i ) (, 1] l t (y i, f (x i )) = 2 ln sec( π 4 (1 y i f (x i ))) if y i f (x i ) ( 1, +1) 0 if y i f (x i ) [+1, + )

Bayesian SVMs Trigonometric loss function 10 8 6 loss 4 2 0 1 0.5 0 0.5 1 1.5 2 z = y f (x) Figure: Trigonometric loss function introduced by Chu et al.

Trigonometric loss Comparison to other losses 4 3 decision boundary margin trigonometric loss SVC L 2 SVC L 1 0/1 loss loss 2 1 0 1 0.5 0 0.5 1 1.5 2 z = y f (x) Figure: Comparison of the trigonometric loss function introduced by Chu et al. to other losses proposed in the literature.

Trigonometric likelihood function P t (y i f (x i )) 1 0.8 Pt(yi f (xi)) 0.6 0.4 0.2 0 2 1 0 1 2 z = y f (x) Figure: Likelihood implied by trigonemetric loss.

Trigonometric likelihood function P t (y i f (x i )) 1 0.8 missclassified trigonometric likelihood logistic likelihood correctly classified Pt(yi f (xi)) 0.6 0.4 0.2 0 4 2 0 2 4 z = y f (x) Figure: Comparison of the likelihood implied by the trigonometric loss to the likelihood implied by logistic loss.

Posterior probability and MAP Based on Bayes theorem, the posterior probability of f can be written as P(f D, θ) = 1 Z R exp ( R(f)) where R(f) = 1 2 ft Σ 1 f + m i=1 l(y i, f (x i )). MAP estimate on values of f MAP is minimizer of the following optimization problem min f R(f) = 1 2 ft Σ 1 f + m l(y i, f (x i )) Solving this optimization problem is similar to the one that arises in standard SVMs. i=1

Hyperparameter Inference Use again Bayes theorem P(θ D) = P(D θ)p(θ)/p(d) The technical details are rather complicated, as it involves solving high dimensional integrals and some gradient descent-like optimization methods. But: one can approximately solve them rather efficient. What s important: depends on f! R(f) from previous slide enters the computation of P(D θ), so we really use the information provided by our BSVM classifier. What have we gained? We get a parameter vector θ which contains κ a,l the relevance of feature l for l = 1..., n.

Feature Ranking Extract correctly normalized relevance variables {r i } n i=1 ARD parameters r i κ a,i = n j=1 κ a,j from the Severe problem for inference of θ: local minimas Gradient-descent like methods can get stucked in local minimas! Solution: Repeat the minimization several times for random start points and assume underlying distribution of P(θ D) is a superposition of Gaussians centered at the different minimas we get.

Posterior P(θ D) 0.4 0.3 g 1 (x) g 2 (x) g 1 (x) + g 2 (x) prob 0.2 0.1 0 4 2 0 2 4 x Figure: Reconstructed Posterior

Feature Ranking and Selection 1 2 3 4 5 6 7 8 Given some learning algorithm L. initialize validation error to infinity, and k = 0; repeat k=k+1; use the top k features as input vector to L; carry out cross validation via grid searching on model parameters; pick up best validation error; until validation error is increasing significantly ; return the top k 1 features as the minimal subset

NIPS 2003 results (1/2) 0.2 0.15 BER 0.1 0.05 0 Overall Arcene Dexter Dorothea Gisette Madelon

NIPS 2003 results (2/2) 0.25 0.5 Relevance Variable 0.2 0.15 0.1 0.05 Error Rate 0.4 0.3 0.2 0 0 5 10 15 0.25 Feature Rank 0.1 0 5 10 15 0.5 Number of Top-ranked Features 0.2 0.4 Fisher Score 0.15 Error Rate 0.3 0.1 0.2 0.05 0 5 10 15 20 Feature Rank 0.1 0 5 10 15 20 Number of Top-ranked Features Figure: Results on Madelon data set for BSVM feature ranking (top, red) and fisher score (bottom, blue). Values of relevance variables (left) and validation error rate (right).

Conclusion and take-home message Conclusion Some approaches seem interesting, but I as well have some doubts about the method Nice: You get posterior probabilites rather than just class labels. Seems quite slow (only a vague impression). A lot of buzz words: Bayesian Inference, Support Vector Machines, Gaussian process. World formula of feature selection? Take-home message One can create an embedded method for Support Vector Machines to do feature selection, but at the expense of quite a bit of complexity.

Appendix: Notation & Acknowledgment Notation mapping function f := [f (x 1 ),..., f (x m )] T. N denotes the number of support vectors. m denotes the number of data points. n denotes the number of features. Acknowledgment I want to thank Isabelle Guyon for the immense support while preparing the slides of this talk.