Knowledge-Based Nonlinear Kernel Classifiers

Knowledge-Based Nonlinear Kernel Classifiers Glenn M. Fung, Olvi L. Mangasarian, and Jude W. Shavlik Computer Sciences Department, University of Wisconsin Madison, WI 5376 {gfung,olvi,shavlik}@cs.wisc.edu Abstract. Prior knowledge in the form of multiple polyhedral sets, each belonging to one of two categories, is introduced into a reformulation of a nonlinear kernel support vector machine (SVM) classifier. The resulting formulation leads to a linear program that can be solved efficiently. This extends, in a rather unobvious fashion, previous work [3] that incorporated similar prior knowledge into a linear SVM classifier. Numerical tests on standard-type test problems, such as exclusive-or prior knowledge sets and a checkerboard with 6 points and prior knowledge instead of the usual points, show the effectiveness of the proposed approach in generating sharp nonlinear classifiers based mostly or totally on prior knowledge. Keywords: prior knowledge, support vector machines, linear programming Introduction Support vector machines (SVMs) have played a major role in classification problems [5, 2, ]. However unlike other classification tools such as knowledge-based neural networks [3, 4, 4], little work [, 3] has gone into incorporating prior knowledge into support vector machines. In this work we extend the previous work [3] of incorporating multiple polyhedral sets as prior knowledge for a linear classifier to nonlinear kernel-based classifiers. This extension is not an obvious one, since it depends critically on the theory of linear inequalities and cannot be incorporated directly into a nonlinear kernel classifier. However, if the kernel trick is employed after one uses a theorem of the alternative for linear inequalities [9, Chapter 2], then incorporation of polyhedral knowledge sets into a nonlinear kernel classifier can be achieved. We show this in Section 2 of the paper. In Section 3 we derive a linear programming formulation that generates a nonlinear kernel SVM classifier that is based on conventional data as well as two groups of polyhedral sets, each group of which belongs to one of two classes. We note that conventional datasets are not essential to our formulation and can be surrogated by samples taken from the knowledge sets. In Section 4 we test our formulation on standard-type test problems. The first test problem is the exclusive-or (XOR) problem consisting of four points in 2-dimensional input space plus four Also Department of Mathematics, University of California at San Diego, La Jolla, CA 9293.

2 polyhedral knowledge sets, all of which get classified perfectly by a Gaussian kernel knowledge-based classifier. The second test problem is the checkerboard problem consisting of 6 two-colored squares. Typically this problem is classified based on points. Here, by using only 6 points plus prior knowledge, our knowledge-based nonlinear kernel classifier, generates a sharp classifier that is as good as that obtained by using points. Section 5 concludes the paper. We now describe our notation. All vectors will be column vectors unless transposed to a row vector by a prime. The scalar (inner) product of two vectors x and y in the n-dimensional real space R n will be denoted by x y. For x R n, x p denotes the p-norm, p =, 2,. The notation A R m n will signify a real m n matrix. For such a matrix, A will denote the transpose of A and A i will denote the i-th row of A. A vector of ones in a real space of arbitrary dimension will be denoted by e. Thus for e R m and y R m the notation e y will denote the sum of the components of y. A vector of zeros in a real space of arbitrary dimension will be denoted by. The identity matrix of arbitrary dimension will be denoted by I. A separating plane, with respect to two given point sets A and B in R n, is a plane that attempts to separate R n into two halfspaces such that each open halfspace contains points mostly of A or B. A bounding plane to the set A is a plane that places A in one of the two closed halfspaces that the plane generates. The abbreviation s.t. stands for such that. For A R m n and B R n k, a kernel K(A, B) maps R m n R n k into R m k. In particular, if x and y are column vectors in R n then, K(x, y) is a real number, K(x, A ) is a row vector in R m and K(A, A ) is an m m matrix. We shall make no assumptions on our kernels other than symmetry, that is K(x, y) = K(y, x), and in particular we shall not assume or make use of Mercer s positive definiteness condition [5, 2]. The base of the natural logarithm will be denoted by ε. A frequently used kernel in nonlinear classification is the Gaussian kernel [5, 2, ] whose ijth element, i =..., m, j =..., k, is given by: (K(A, B)) ij = ε µ Ai B j 2, where A R m n, B R n k and µ is a positive constant. 2 Prior Knowledge in a Nonlinear Kernel Classifier We begin with a brief description of support vector machines (SVMs). SVMs are used principally for classification [, 5, 2]. The simplest classifier is a linear separating surface, a plane in R n : x w = γ, () where w R n determines the orientation of the plane (), in fact it is the normal to the plane, and γ determines the location of the plane relative to the origin. The separating plane () lies midway between two parallel bounding planes: x w = γ +, x w = γ, each of which attempts to place each class of points in one of the two halfspaces: {x x w γ + }, {x x w γ }. (2) (3)

3 In addition, these bounding planes are pushed apart as far as possible. To obtain a more complex classifier, one resorts to a nonlinear separating surface in R n instead of the linear separating surface () defined as follows []: K(x, A )Du = γ, (4) where K is an arbitrary nonlinear kernel as defined in the Introduction, and (u, γ) are determined by solving the linear program (9). Here, A R m n represents a set of m points in R n each of which belonging to class A + or A depending on whether the corresponding element of a given m m diagonal matrix D is + or, and u R m is a dual variable. The linear separating surface () becomes a special case of the nonlinear surface (4) if we use the linear kernel K(A, A ) = AA and set w = A Du [, 8]. We turn now to the incorporation of prior knowledge in the form of a polyhedral set into a nonlinear kernel classifier. But first, we show that a routine incorporation of such knowledge leads to a nonlinear system of nonconvex inequalities that are not very useful. Suppose that the polyhedral {x Bx b} where B R l n and b R l, must lie in the halfspace {x x w γ + } for some given w R n and γ R. We thus have the implication: Bx b = x w γ +. (5) By letting w take on its dual representation w = A Du [, 8], the implication (5) becomes: Bx b = x A Du γ +. (6) If we now kernelize this implication by letting x A K(x, A ), where K is some nonlinear kernel as defined in the Introduction, we then have the implication, for a given A, D, u and γ, that: Bx b = K(x, A )Du γ +. (7) This is equivalent to the following nonlinear, and generally nonconvex, system of inequalities not having a solution x for a given A, D, u and γ: Bx b, K(x, A )Du < γ +. (8) Unfortunately, the nonlinearity and nonconvexity of the system (8) precludes the use of any theorem of the alternative for either linear or convex inequalities [9]. We thus have to backtrack to the implication (6) and rewrite it equivalently as the following system of homogeneous linear inequalities not having a solution (x, ζ) R n+ for a given fixed u and γ: Bx bζ, u DAx (γ + )ζ <, ζ <. (9)

4 Here, the positive variable ζ is introduced in order to make the inequalities (9) homogeneous in (x, ζ), thus enabling us to use a desired theorem of the alternative [9] for such linear inequalities. It follows by Motzkin s Theorem of the Alternative [9], that (9) is equivalent to the following system of linear inequalities having a solution in (v, η, τ) R l++ for a given fixed u and γ: B v +(A Du)η =, b v (γ + )η τ =, v, (η, τ). () Here, the last constraint signifies that at most one of the two nonnegative variables η and τ can be zero. Hence, if η =, then τ >. It follows then from () that there exists a v such that: B v =, b v >, v, which contradicts the natural assumption that the knowledge set {x Bx b} is nonempty. Otherwise, we have the contradiction: = v Bx b v <. () Hence η > and τ. Dividing the inequalities of () by η and redefining v as v η, we have from () that the following system of linear equalities has a solution v for a given u and γ: B v +A Du =, b v +γ +, v. (2) Under the rather natural assumption that A has linearly independent columns, this in turn is equivalent to following system of linear equalities having a solution v for a given u and γ: AB v +AA Du =, b v +γ +, v. (3) Note that the linear independence is needed only for (3) to imply (2). Replacing the the linear kernels AB and AA by the general nonlinear kernels K(A, B ) and K(A, A ), we obtain that the following system of linear equalities has a solution v for a given u and γ: K(A, B )v +K(A, A )Du =, b v +γ +, v. (4) This is the set of constraints that we shall impose on our nonlinear classification formulation as a surrogate for the implication (7). Since the derivation of the conditions were not directly obtained from (7), it is useful to state precisely what the conditions (4) are equivalent to. By using a similar reasoning that employs theorems of the alternative as we did above, we can derive the following equivalence result which we state without giving its explicit proof. The proof is very similar to the arguments used above.

5 Proposition 2 Knowledge Set Classification Let {y K(B, A )y b}. (5) Then the system (4) having a solution v, for a given u and γ, is equivalent to the implication: K(B, A )y b = u DK(A, A )y γ +. (6) We note that the implication is not precisely the implication (7) that we started with, but can be thought of as a kernelized version of it. To see this we state a corollary to the above proposition which shows what the implication means for a linear kernel AA. Corollary 22 Linear Knowledge Set Classification Let {y Bx b, x = A y}. (7) For a linear kernel K(A, A ) = AA, the system (4) having a solution v, for a given u and γ, is equivalent to the implication: Bx b, x = A y = w x γ +, w = A Du, x = A y. (8) We immediately note that the implication (8) is equivalent to the desired implication (5) for linear knowledge sets, under the rather unrestrictive assumption that A has linearly independent columns. That the columns of A are linearly independent is equivalent to assuming that in the input space, features are not linearly dependent on each other. If they were, then linearly dependent features could be easily removed from the problem. We turn now to a linear programming formulation of a nonlinear kernel classifier that incorporates prior knowledge in the form of multiple polyhedral sets. 3 Knowledge-Based Linear Programming Formulation of Nonlinear Kernel Classifiers A standard [, ] linear programming formulation of a nonlinear kernel classifier is given by: min u,γ,r,y νe y + e r s.t. D(K(A, A )Du eγ) + y e, r u r, y. (9) The (u, γ) taken from a solution (u, γ, r, y) of (9) generates the nonlinear separating surface (4). Suppose now that we are given the following knowledge sets: p sets belonging to A+ : {x B i x b i }, i =,..., p, q sets belonging to A : {x C i x c i }, i =,..., q. (2)

6 It follows from the implication (7) for B = B i and b = b i for i =,..., p and its consequence, the existence of a solution to (4), and a similar implication for the sets {x C i x c i }, i =,..., q, that the following holds: There exist s i, i =,..., p, t j, j =,..., q, such that: K(A, B i )s i + K(A, A )Du =, b i s i + γ +, s i, i =,..., p, K(A, C j )t j K(A, A )Du =, c j t j γ +, t j, j =,..., q. (2) We now incorporate the knowledge sets (2) into the nonlinear kernel classifier linear program (9) by adding conditions (2) as constraints to (9) as follows: min u,γ,r,(y,s i,t j ) νe y + e r s.t. D(K(A, A )Du eγ) + y e, r u r, K(A, B i )s i + K(A, A )Du =, b i s i + γ +, i =,..., p, K(A, C j )t j K(A, A )Du =, c j t j γ +, j =,..., q. (22) This linear programming formulation incorporates the knowledge sets (2) into the appropriate halfspaces in the higher dimensional feature space. However since there is no guarantee that we are able to place each knowledge set in the appropriate halfspace, we need to introduce error variables z i, ζi, i =,..., p, zj 2, ζj 2, j =,..., q, just like the error variable y of the SVM formulation (9), and attempt to drive these error variables to zero by modifying our last formulation above as follows: min νe y + e r + µ( u,γ,r,z i,zj 2,(y,si,t j,ζ i,ζj 2 ) p q (e z i + ζ) i + (e z j 2 + ζj )) i= j= s.t. D(K(A, A )Du eγ) + y e, r u r, z i K(A, B i )s i + K(A, A )Du z, i b i s i + γ + ζ i, i =,..., p, z j 2 K(A, Cj )t j K(A, A )Du z j 2, c j t j γ + ζ j 2, j =,..., q. (23) This is our final knowledge-based linear programming formulation which incorporates the knowledge sets (2) into the linear classifier with weight µ, while the (empirical) error term e y is given weight ν. As usual, the value of these two parameters, ν, µ, are chosen by means of a tuning set extracted from the training set.

7 Remark 3 Data-Based and Knowledge-Based Classifiers If we set µ =, then the linear program (23) degenerates to (9), the linear program associated with an ordinary data-based nonlinear kernel SVM. We can also make the linear program (23), which generates a nonlinear classifier, to be only knowledgebased and not dependent on any specific training data if we replace the matrix A appearing everywhere in (23) by a random sample of points taken from the knowledge sets (2) together with the associated diagonal matrix D. This might be a useful paradigm for situations where training datasets are not easily available, but expert knowledge, such as doctors experience in diagnosing certain diseases, is readily available. In fact, using this idea of making A and D random samples drawn from the knowledge sets (2), the linear programming formulation (23) as is can be made totally dependent on prior knowledge only. We turn now to our numerical experiments. 4 Numerical Experience The focus of this paper is rather theoretical. However, in order to illustrate the power of the proposed formulation, we tested our algorithm on two synthetic examples for which most or all the data is constituted of knowledge sets. Experiments involving real world knowledge sets will be utilized in future work. Exclusive-Or (XOR) Knowledge Sets This example generalizes the well known XOR example which consists of the four vertices of a rectangle in 2-dimensions, with the pair of vertices on the end of one diagonal belonging to one class (crosses) while the other pair belongs to another class (stars). Figure depicts two such pairs of vertices symmetrically placed around the origin. It also depicts two pairs of knowledge sets with each pair belonging to one class (two triangles and two parallelograms respectively). The given points in this XOR example can be considered in two different ways. We note that in line with Remark 3, this classifier can be considered either partially or totally dependent on prior knowledge, depending on whether the four data points are given independently of the knowledge sets or as points contained in them. Our knowledge-based linear programming classifier (23) with a Gaussian kernel yielded the depicted nonlinear separating surface that classified all given points and sets correctly. Another realization of the XOR example is depicted in Figure 2. Here the data points are not positioned symmetrically with respect to the origin and only one of them is contained in a knowledge set. The resulting nonlinear separating surface for this XOR example is constrained by one of the knowledge sets, in fact it is tangent to one of the diamond-shaped knowledge sets.

8 3 2 2 3 3 2 2 3 Fig.. Totally or partially knowledge-based XOR classification problem. The nonlinear classifier obtained by using a Gaussian kernel in our linear programming formulation (23), completely separates the two pairs of prior knowledge sets as well the two pairs of points. The points can be treated as samples taken from the knowledge sets or given independently of them. 3 2 2 3 3 2 2 3 Fig. 2. Another XOR classification problem where only one of the points is contained in a knowledge set. The nonlinear classifier obtained by using a Gaussian kernel in our linear programming formulation (23), completely separates the two pairs of prior knowledge sets as well the two pairs of points. Note the strong influence of the knowledge sets on the separating surface which is tangent to one of the knowledge sets.

9 Checkerboard Our second example is the classical checkerboard dataset [5, 6, 8, 7] which consists of points taken from a 6-square checkerboard. The following experiment on this dataset shows the strong influence of knowledge sets on the separating surface. We first took a subset of 6 points only, each one is the center of one of the 6 squares. Since we are using a nonlinear Gaussian kernel to model the separating surface, this particular choice of the training set is very appropriate for the checkerboard dataset. However, due to the nature of the Gaussian function it is hard for it to learn the sharpness of the checkerboard by using only a 6-point Gaussian kernel basis. We thus obtain a fairly poor Gaussian-based representation of the checkerboard depicted in Figure 3 with correctness of only 89.66% on a testing set of uniformly randomly generated 39,6 points labeled according to the true checkerboard pattern..8.6.4.2.2.4.6.8.8.6.4.2.2.4.6.8 Fig. 3. A poor nonlinear classifier based on 6 points taken from each of the 6 squares of a checkerboard. The nonlinear classifier was obtained by using a Gaussian kernel in a conventional linear programming formulation (9).

On the other hand, if we use these same 6 points in the linear program (23) as a distinct dataset in conjunction with prior knowledge in the form of 8 linear inequalities characterizing only two subsquares fully contained in the leftmost two squares of the bottom row of squares, we obtain the very sharply defined checkerboard depicted in Figure 4, with a correctness of 98.5% on a 39, 6-point testing set. We note that, it does not matter which two subsquares are taken as prior knowledge, as long as there is one from each class. Also the size of the subsquare is not critical either. Figure 4 was obtained with subsquares with sides equal to.75 times the the original sides and centered around the given data points. Squares down to size.25 of the original squares, gave similar results. We emphasize here that the prior knowledge given to the linear program (23) here is truly a partial knowledge set in the sense that it gives the linear program information on only 2 of 6 squares..8.6.4.2.2.4.6.8.8.6.4.2.2.4.6.8 Fig. 4. Knowledge-based checkerboard classification problem. The nonlinear classifier was obtained by using a Gaussian kernel in our linear programming formulation (23) with the 6 depicted points as a given dataset together with prior knowledge consisting of 8 linear inequalities characterizing only two subsquares contained in the leftmost two squares of the bottom row of squares. The sharply defined checkerboard has a correctness of 98.5% on a 39, 6-point testing set.

It is interesting to note that by using prior knowledge from only two squares, our linear programming formulation (23) is capable of transforming a complex boundary between the union of eight squares in each class, into a single nonlinear classifier equation given by (4). 5 Conclusion We have presented a knowledge-based formulation of a nonlinear kernel SVM classifier. The classifier is obtained using a linear programming formulation with any nonlinear symmetric kernel with no positive definiteness (Mercer) condition assumed. The formulation works equally well with or without conventional datasets. We note that unlike the linear kernel case with prior knowledge [3], where the absence of conventional datasets was handled by deleting some constraints from a linear programming formulation, here arbitrary representative points from the knowledge sets are utilized to play the role of such datasets, that is A and D in (23). The issues associated with sampling the knowledge sets, in situations where there are no conventional data points, constitute an interesting topic for future research. Future application to medical problems, computer vision, microarray gene classification, and efficacy of drug treatment, all of which have prior knowledge available, are planned. Acknowledgments Research described in this UW Data Mining Institute Report 3-2, March 23, was supported by NSF Grant CCR-3838, by NLM Grant R LM75-, and by Microsoft. References. P. S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and support vector machines. In J. Shavlik, editor, Machine Learning Proceedings of the Fifteenth International Conference(ICML 98), pages 82 9, San Francisco, California, 998. Morgan Kaufmann. ftp://ftp.cs.wisc.edu/math-prog/techreports/98-3.ps. 2. V. Cherkassky and F. Mulier. Learning from Data - Concepts, Theory and Methods. John Wiley & Sons, New York, 998. 3. G. Fung, O. L. Mangasarian, and J. Shavlik. Knowledge-based support vector machine classifiers. Technical Report -9, Data Mining Institute, Computer Sciences Department, University of Wisconsin, Madison, Wisconsin, November 2. ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/-9.ps, NIPS 22 Proceedings, to appear. 4. F. Girosi and N. Chan. Prior knowledge and the creation of virtual examples for RBF networks. In Neural networks for signal processing, Proceedings of the 995 IEEE-SP Workshop, pages 2 2, New York, 995. IEEE Signal Processing Society.

2 5. T. K. Ho and E. M. Kleinberg. Building projectable classifiers of arbitrary complexity. In Proceedings of the 3th International Conference on Pattern Recognition, pages 88 885, Vienna, Austria, 996. http://cm.belllabs.com/who/tkh/pubs.html. Checker dataset at: ftp://ftp.cs.wisc.edu/mathprog/cpo-dataset/machine-learn/checker. 6. T. K. Ho and E. M. Kleinberg. Checkerboard dataset, 996. http://www.cs.wisc.edu/math-prog/mpml.html. 7. Y.-J. Lee and O. L. Mangasarian. RSVM: Reduced support vector machines. Technical Report -7, Data Mining Institute, Computer Sciences Department, University of Wisconsin, Madison, Wisconsin, July 2. Proceedings of the First SIAM International Conference on Data Mining, Chicago, April 5-7, 2, CD- ROM Proceedings. ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/-7.ps. 8. Y.-J. Lee and O. L. Mangasarian. SSVM: A smooth support vector machine. Computational Optimization and Applications, 2:5 22, 2. Data Mining Institute, University of Wisconsin, Technical Report 99-3. ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/99-3.ps. 9. O. L. Mangasarian. Nonlinear Programming. SIAM, Philadelphia, PA, 994.. O. L. Mangasarian. Generalized support vector machines. In A. Smola, P. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 35 46, Cambridge, MA, 2. MIT Press. ftp://ftp.cs.wisc.edu/mathprog/tech-reports/98-4.ps.. B. Schölkopf, P. Simard, A. Smola, and V. Vapnik. Prior knowledge in support vector kernels. In M. Jordan, M. Kearns, and S. Solla, editors, Advances in Neural Information Processing Systems, pages 64 646, Cambridge, MA, 998. MIT Press. 2. A. Smola and B. Schölkopf. Learning with Kernels. MIT Press, Cambridge, MA, 22. 3. G. G. Towell and J. W. Shavlik. Knowledge-based artificial neural networks. Artificial Intelligence, 7:9 65, 994. 4. G. G. Towell, J. W. Shavlik, and M. Noordewier. Refinement of approximate domain theories by knowledge-based artificial neural networks. In Proceedings of the Eighth National Conference on Artificial Intelligence (AAAI-9), pages 86 866, 99. 5. V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, second edition, 2.