An Analytical Comparison between Bayes Point Machines and Support Vector Machines

Size: px

Start display at page:

Download "An Analytical Comparison between Bayes Point Machines and Support Vector Machines"

Gwendoline Chandler
5 years ago
Views:

1 An Analytical Comparison between Bayes Point Machines and Support Vector Machines Ashish Kapoor Massachusetts Institute of Technology Cambridge, MA Abstract This paper analyzes the relationship and the differences between the two variants of the kernel machines, namely Bayes Point Machine (BPM) and the Support Vector Machine (SVM). We pose both BPM and SVM as an estimation problem in a probabilistic framework. Given training data and a loss function, a posterior probability distribution on the space of functions is induced. The BPM solution is shown to be the mean of this posterior whereas the SVM is shown the Maximum A Posteriori (MAP) solution when using the hinge loss function. 1 Introduction There has been a lot of research directed to the Kernel Machines. Support Vector Machine (SVM) [1] has been inspired from statistical learning theory, whereas the Bayes Point Machine (BPM) is a Bayesian approximation to the classification. Support vector machine looks at the learning problem from the optimization perspective whereas the Bayesian perspective relies on sampling from the probability distributions. Despite these differences there seem to be a close relationship between the SVM and the BPM. Herbrich et al [2] have highlighted some of these similarities. This paper aims to pose BPM and SVM in one single framework and analyze the similarities and the differences between the two approaches. The next section provides a very quick overview of BPM. Followed by that we discuss SVM as a special case of Tikhonov regularization and give a probabilistic interpretation. In section 4 we pose both SVM and BPM in a probabilistic framework which allows us to compare them effectively. Followed by that we conclude with future work. We limit this discussion to two class classification problems. We denote the data points using bold letters and the corresponding class labels by. The training set comprises of tuples, that is, where is the number of training samples. The classifiers are denoted by and belong to a fixed hypothesis space!. Further, we restrict to linear kernels without bias for simplicity, therefore classifiers are of type #" %$ sign"'&)(+* %$. Finally,,-"' /. $ is a loss function that represents loss occurred in estimating as.. Though here we limit ourselves to linear kernel but the discussion can easily be extend to non-linear kernels by just considering the high-dimensional feature space onto which the data points are projected.

2 = SVM Bayes Figure 1: Difference between SVM and Bayesian Classification Strategy 2 Bayes Point Machine This section provides a very brief overview of the Bayes point machine. The readers wishing for more details should look into [2, 3]. 2.1 The Bayes Classification Strategy Given a test point the Bayes classification strategy looks at how is classified by all the possible classifiers! and weighs it according to the posterior "6 $. The following equation depicts the Bayesian classification strategy: 798 /:<; " %$ 1 #" %$ 0>"6? $A@ (1) This classification strategy, which is the Bayesian averaging of linear classifiers, has been proven both theoretically and empirically optimal on average in terms of generalization performance [4, 5]. But, the Bayes classification strategy is computationally very demanding and further it is just a strategy and in general may not correspond to any one single classifier within the hypothesis space considered. 2.2 Bayes Point Bayes Point is a classifier that best mimics the Bayesian classification strategy. It was shown F elsewhere [2, 5] that under certain mild assumptions the average 1 34GIH classifiercbde converges very quickly to the Bayes point. Bayes Point Machine is an algorithm that aims at returning the center of mass under the posterior We can think of the center of mass approximating the Bayes classification strategy as follows:

3 = U 798 /:<; " %$ 1 #" %$KJ "L? $M@N F 1 345G #" %$ HPO F 1 34GIHQ" %$ (2) Hence, the Bayes point machine returns the average classifierbd, which very closely approximates the Bayesian classification strategy. Figure 1 shows the difference between the support vector machine classification and the Bayesian classification strategy. The Bayes point machines approximates a vote between all linear separators [2, 4], whereas the support vector machine aims to maximize the margin [1]. Computing the center of mass is a difficult task. And many of the authors have used sampling methods to recover the mean of the posterior [2, 3]. To compute the center of mass we can write the posterior0>"l? $ as: J "L? $RSJ "T+? $ * J "L $ (3) HereJ "L $ represents the prior on the space of possible classifiers and most of the authors &-(, only the direction vector& characterizes the function. Further, as the magnitude of & is irrelevant for classification, we can just look at all the& s of a fixed length. As there is no reason for us to prefer any single classifier prior to looking at the data, our prior is uniform for all& of a fixed length. This distribution is fair to all the classifiers. have restricted to a uniform prior. Since our classification function is of the form: #" %$ Now,J "T+? $ in equation (3) is the likelihood of the training data given, and one of the possible forms is: J "T+? $ VAWXYZ[XT\^] 4 `_a "6#"cb$d b6$ (4) where _ea ". $ is the zero-one loss and defined as: `_a ". $ if gfh ij. (5) otherwise k Figure 2: SVM and BPM in the Version Space, from PhD Thesis, Tom Minka [4]

4 h The likelihood given as in equation (4) is 1 if perfectly classifies the training data otherwise it is zero. Under this likelihood the posterior assigns equal non-zero probability to all the classifiers of a fixed length that perfectly separates the data. The Bayes point machine is expected to return the average of all these classifiers. Further, we can consider a graphical interpretation of the BPM too. In the feature space the data points are plotted as points and the classifiers are plotted as hyper planes. We can consider a parameter space, where the classifiers are plotted as points and the data points plotted as hyper planes. In the parameter space the set of classifiers that classify all the data points form a convex set, called Version Space bounded by the hyper planes corresponding to the data points. Limiting our classifiers to a fixed length l corresponds to looking a sphere of radius l in the parameter space. Under uniform prior and 0-1 loss the BPM will return the center of the Version space. Figure 2 (from [4]) shows this interpretation. 3 Support Vector Machines Classifiers based on support vector machine (SVM) perform binary classification by first projecting the data points into a high dimensional feature space and then using a hyper plane that is maximally separated from the nearest positive and negative data points [1]. In this discussion, we have restricted to linear kernels but we can easily extend all our discussion for non-linear kernels by just considering the high-dimensional feature space onto which the data points are projected. For a linear kernel without bias (i.e. #" %$ g&+( ), the quadratic programming problem for an SVM can be written as following: &EnmoqpristvuNwx*zy b{ P} bi~??&+?? subject to: b #"cb$ ƒ } b } b for # h for C Herew is the user specified constant. This can be rewritten as following: & xmo prist u w * b{ " ˆ b #" b $A$A ~ Š??&+?? (6) where: "L $ f if k (7) otherwise h k As mentioned in Herbrich et al [2] support vector machine classifier can be thought of as the center of maximally inscribbable ball in the version space. Figure 2 (From [4]) shows this graphical interpretation. Further, Evgeniou et al [6] have shown that the SVM classification is an instance of more general Tikhonov regularization. Tikhonov regularization is a general approach to learning, where the aim is to find a function that minimizes the training error while simultaneously attempting to minimize its norm in a Reproducing Kernel Hilbert Space (RKHS)!. The Tikhonov regularization can be written as: Œnmo prist 1 ] * b{,-"' b #" b $A$%~ Ž???? (8) Here,,-"T #"' $A$ is the loss function as defined earlier,ž is the user specified regularization parameter and???? is the norm in RKHS defined by the positive definite kernel function

5 . The first term in the Tikhonov regularization denotes the empirical error and the second term denotes the complexity of our solution. The Tikhonov regularization represents the trade off in choosing functions that are not only simple, but also best represent our training data. Ž is the regularization term that can be used to adjust the preferences of simple functions over the preference of the functions that best fits the data. By changing the form of the loss function,-"t #"' $A$ and, a number of popular classification and regression schemes can be derived. Evgeniou et al [6] have discussed standard regularization networks, SVM classification and regression in detail as different cases of Tikhonov regularization that arise from different choices of, and. For example algorithms for standard regularization networks can be derived by using a squared loss function, i.e., "' bq #"cb$a$ E"L #"cb6$ b6$. Further in this discussion, we have restricted to linear kernels which implies that??????&+??. For details please refer to [2, 6]. 3.1 Maximum A Posteriori Interpretation of SVM Classification The solution to the the optimization problem for SVM classification and (also, Tikhonov Regularization) can be interpreted as the mode of the posterior probability distribution. Consider the interpretation of the empirical loss y b{,>"' b #" b $q$ and the stabilizer???? as: J "L $R : a (9) J "T+? $R : ap šœ X Ÿž# 1 W XT YZ XK (10) J "6 $ denotes the prior probability and under this interpretation it says that the functions with small norm are more likely than the functions with a larger norm.j "L+? $ denotes the likelihood of the data and often is referred to as noise model. For standard regularization network the noise model is Gaussian and for SVM regression is a mixture of Gaussians. Readers are referred to Girosi et al [7] and Pontil et al [8] for more details. It is clear that given, the quadratic programming problem corresponding to SVM, we can interpret the solution to it as a mode of the probability distribution: 4 BPMs and SVMs J "L? $R J "T+? $ * J "L $ j: ap š X Ÿž# 1 WX YZ[X a x: a š X ž# 1 W X' YZ X' a x: a š X [ a Z[X 1 WX ' T a 3 3u 3 3 As shown earlier Bayes Point Machines are the mean of the posteriorj "L? $. In the last section we showed that the support vector machine solution is the mode of the posterior. The the posteriorsj "L? $ for the SVMs and the BPMs are not same and in this section we analyze the differences as well as the similarities. 4.1 The Priors A lot of authors while working with BPM restrict to the classifiers& of fixed equal length and assign uniform prior for all directions. As mentioned in [4], this can also be achieved using a zero mean spherical Gaussian distribution as prior. That is, J "L $R : a 3 3u 3 3 R : a (11) (12) (13) (14)

6 Table 1: Bayes Point Machines and Support Vector Machines Given 2 5 ª «²±³µ ƒ Ḿ¹º» T¼L¼ ½ ¾Ÿ º ¾À Classification Type Cost function:(,>"' #"' $q$ ) Computation Criteria Bayes Point Machine Can be any reasonable cost function Mean of0>"l? $ e.g.:,-"6 #"' A %$ Á" v #" %$A$ i.e.â'ãed F 1 34GIH Tikhonov Regularization Can be any reasonable cost function Mode of0>"l? $ SVM Classification, "L #"T q %$ Ä" ƒ v #" %$A$ Å ž Æ nmoqp rimç 1 0>"L? $ Regularization Network,-"6 #"' A %$ Á"' #" %$A$ <ÈÉ jmo pr>m<ç 1 0>"L? $ This prior assigns uniform distribution to all the classifiers & lying on a sphere of a fixed radius. The beauty of this prior is that it allows us to drop the restriction that all our classifiers are of the same length. Further, this prior is exactly the same prior used to compute the posterior for the SVM case. 4.2 The Likelihood Much of the BPM literature focuses on the 0-1 loss with the likelihood J "T+? $ given in equation 4. The SVM on the other hand, have the likelihood "J "TÊ? $q$ as : a š9 X ž Z X Y1 W ' with the hinge loss, i.e.,-"t /. $ Ë" *. $[. Using a hard 0-1 loss in BPM corresponds to focusing only on the region that perfectly classifies the training data. We can refine this hard 0-1 loss to admit the possibility of error by using linear slack. We can use the following as our loss function:, "L#" b b $ wj* " b #" b $A$A (15) Given, this loss function we can write the likelihood for BPM as R : a š9 X Ÿž Z[XY1 W ' (16) Here, C is a constant that determines how hard are the boundaries. We get likelihood given in equation 4 when the constantw tends to infinity. 4.3 Main Result The main result is shown in table 1 and we can state the following: Ì The solution obtained using BPM is the mean of the posteriorj "L? $. One of the possible likelihoods for the BPM is given byj "L+? $ j: a šœ X [ a Z[X 1 WX T T.

7 Ì The solution obtained by SVM on the other hand is the mode of the posterior with the likelihood function exactly given by: a š X a Z X 1 W X' ' Ì The priorj "L $R : a Conclusion and Future Work for both BPM and SVM. We have posed the SVM and the BPM in a probabilistic framework. Under this framework, BPM finds the mean of the posterior distribution of functions, whereas SVM finds the MAP estimate when using the hinge loss function. So the question of who is is better than who boils down to the choice of mean vs mode and the choice of loss functions. BPM has the advantage that it usually works with the 0-1 loss which is a natural choice for the loss function. Elsewhere it has been proved that BPM converges to the Bayes point which is the projection of the Bayesian classification strategy. SVM on the other hand has been shown to work really well in many applications and is much easier to compute than BPM. The open questions include the performance of a classifier that is the mean of the posterior when using a hinge loss. Further, no one has yet answered the questions regarding stability, consistency and convergence of the BPM. Also, it has been shown that mean of posterior converges to the Bayes point under mild assumptions, so an interesting question to ask would be that how does the mode relate to the Bayes point. Acknowledgments Thanks Tom Minka for the Bayes Point Machine code and to Sayan Mujherjee, Yuan Qi and Rosalind W. Picard for insightful discussions. References [1] Christopher J. C. Burges. A tutorial on support vector machines for pattern classification. Data Mining and Knowledge Discovery, 2(2): , [2] Ralf Herbrich, Thore Graepel, and Colin Campbell. Bayes point machines. Journal of Machine Learning Research, 1: , [3] P. Rujan. Playing billiards in version space. Neural Computation, 9:99 122, [4] Thomas P. Minka. Chapter 5: A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, Massachussetts Institute of Technology, [5] T. Watkin. Optimal learning with a neural network. Europhysics Letters, 21: , [6] Theodoros Evgeniou, Massimiliano Pontil, and Tomaso Poggio. Regularization networks and support vector machines. Advances in Computational Mathematics, 13(1):1 50, [7] F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural network architectures. Neural Computation, 7: , [8] M. Pontil, S. Mukherjee, and F. Girosi. On noise model of support vector machine regression. A.I. Memo 1651, Massachusetts Institute of Technology, A.I. Lab, October 1998.

Kernels for Multi task Learning

Kernels for Multi task Learning Charles A Micchelli Department of Mathematics and Statistics State University of New York, The University at Albany 1400 Washington Avenue, Albany, NY, 12222, USA Massimiliano