To appear in Optimization Methods and Software DISCRIMINATION OF TWO LINEARLY INSEPARABLE SETS. Computer Sciences Department, University of Wisconsin,

Size: px

Start display at page:

Download "To appear in Optimization Methods and Software DISCRIMINATION OF TWO LINEARLY INSEPARABLE SETS. Computer Sciences Department, University of Wisconsin,"

Johnathan Blair
6 years ago
Views:

1 To appear in Optimization Methods and Software ROBUST LINEAR PROGRAMMING DISCRIMINATION OF TWO LINEARLY INSEPARABLE SETS KRISTIN P. BENNETT and O. L. MANGASARIAN Computer Sciences Department, University of Wisconsin, 0 West Dayton Street, Madison, WI A single linear programming formulation is proposed which generates a plane that minimizes an average sum of misclassied points belonging to two disjoint points sets in n-dimensional real space. When the convex hulls of the two sets are also disjoint, the plane completely separates the two sets. When the convex hulls intersect, our linear program, unlike all previously proposed linear programs, is guaranteed to generate some error-minimizing plane, without the imposition of extraneous normalization constraints that inevitably fail to handle certain cases. The eectiveness of the proposed linear program has been demonstrated by successfully testing it on a number of databases. In addition, it has been used in conjunction with the multisurface method of piecewiselinear separation to train a feed-forward neural network with a single hidden layer. KEY WORDS: Linear Programming, Pattern Recognition, Neural Networks INTRODUCTION We consider the two point-sets A and B in the n-dimensional real space R n represented by the m n matrix A and the k n matrix B respectively. Our principal objective here is to formulate a single linear program with the following properties: (i) If the convex hulls of A and B are disjoint, a strictly separating plane is obtained. (ii) If the convex hulls of A and B intersect, a plane is obtained that minimizes some measure of misclassication points, for all possible cases. (iii) No extraneous constraints are imposed on the linear program that rule out any specic case from consideration. Most linear programming formulations 6,5,,4 have property (i), however, none to our knowledge have properties (ii) and (iii). For example, the linear program of Mangasarian 6 fails to satisfy property (ii) for all linearly inseparable cases, while Smith's linear program fails in satisfying (ii) when uniform weights are used in its objective function as originally proposed. The linear programs 5,4 fail in satisfying both (ii) and (iii). Our linear programming formulation on the other hand has

2 K. P. BENNETT and O. L. MANGASARIAN all three properties (i), (ii) and (iii). It is interesting to note that our proposed linear program (.) will always generate some error-minimizing plane even in the usually troublesome case when the means of the two sets are identical. For this case, among possible solutions to our linear program is the null solution. However, this null solution is never unique for our linear program and thus a useful alternative solution is always available. For example, such an alternative, the 45 line, is obtained computationally by our linear program for the classical counterexample of linear inseparability: the Exclusive-Or example. (See Example.7 below.) We outline our results now. In Section we state our linear program (.) and establish that it possesses properties (i)-(iii) above in Theorems.5 and.6. On the other hand in Example.8 we show that ( w = 0; = ) uniquely solves Smith's linear program ((.0b) with = ) and hence property (ii) is violated. Similarly in Remark.9 we give an example which violates property (ii) for Grinold's linear program 5 (.0) and give conditions under which this is always true. In Section 3 we report on some computational results using our proposed linear program on the Wisconsin Breast Cancer Database and the Cleveland Heart Disease Database. See also Bennett-Mangasarian for other computational results using linear programming on these databases. A word about our notation now. For a vector x in the n-dimensional real space R n ; x + will denote the vector in R n with components (x + ) i := maxfx i ; 0g; i = ;...; n: The notation A R mn will signify a real m n matrix. For such a matrix, A 0 will denote the transpose while A i will denote the ith row. The -norm of x; nx i= jx i j; will be denoted by x ; while the -norm of x; max in jx ij; will be denoted by x : A vector of ones in a real space of arbitrary dimension will be denoted by e: A ROBUST LINEAR PROGRAMMING SEPARATION Our linear program is based on the following error-minimizing optimization problem (:) min w; m Aw + e + e + + k Bw e + e + where A R mn represents the m points of the set A; B R kn represents the k points of the set B; w is the n-dimensional \weight" vector representing the normal to the optimal \separating" plane, and the real number is a threshold that gives the location of the separating plane: wx =. The choice of the weights m and k in (.) is critical (as we shall demonstrate below in Theorems.5 and.6) in that it sets it apart from Smith's linear program where equal weights were proposed, and from other linear programming formulations 6,5,4. Our choice we believe is a \natural" one in that the useless null solution w = 0 is not encountered computationally for linearly inseparable sets. This is theoretically justied (Theorem.5

3 ROBUST LINEAR PROGRAMMING DISCRIMINATION 3 below) because w = 0 cannot be a solution unless the following equality between the arithmetic means of A and B holds (:) ea m = eb k However in this case, it is guaranteed that a nonzero optimal w exists in addition to w = 0 (Theorem.6 below). We begin our analysis by justifying the use of the optimization problem (.) which minimizes the average of the misclassied points of A and B by the separating plane xw = : We dene now linear separability for concreteness. Definition.. (Linear Separability) The point sets A and B; represented by the matrices A R mn and B R kn respectively, are linearly separable if and only if (:3) min im A iv > max ik B iv for some v R n or equivalently (:4) Aw e + e; e e Bw for some w R n ; R That (.4) implies (.3) is evident. To see the converse just note the relations (:5) w := v ; := min A iv max B A i v iv > 0; := min + max B i v im ik im ik Note also that when the sets A and B are linearly separable, as dened by (.4), the plane (:6) fxjwx = g is a strict separating plane with (:7) Aw > e and e > Bw With the above denitions the following lemma becomes evident. Lemma.. The sets A and B represented by A R mn and B R kn respectively are linearly separable if and only if the minimum value of (.) is zero, in which case (w = 0; ) cannot be optimal. Proof. Note that the minimum of (.) is zero if and only if Aw + e + e 0 and Bw e + e 0 which is equivalent to the linear separability denition (.4). To see that (w = 0; ) cannot be optimal for (.), note that if we set w = 0 in (.) we get (:8) min = > 0 which contradicts the requirement that the minimum of (.) be zero for linearly separable A and B:

4 4 K. P. BENNETT and O. L. MANGASARIAN o o + + o o o o o o o o o o o o A i o o o o + o + + o o o o o o o o o o o o o o o o o o o o o o o o o o o o o + + o o o o o + + B j wx=γ+ wx=γ wx=γ Figure : An Optimal Separator wx = for Linearly Inseparable Sets: A (o) and B (+) The import of Lemma. is that the optimization problem (.), which is equivalent to the linear program (.) below, will always generate a separating plane wx = for linearly separable sets A and B: For linearly inseparable set A and B the optimization problem (.) will generate an optimal separating plane wx = which minimizes the average violations (:a) m mx i= (A i w + + ) + + k kx i= (B i w + ) + ; of points of A which lie on the wrong side of the plane wx = + ; that is in fx wx < + g; and of points of B which lie on the wrong side of the plane wx = ; that is in fx wx > g: See Fig.. Note also that the location of the plane wx = obtained by minimizing the average violations (.a) can be further optimized by holding w xed at the optimal

5 ROBUST LINEAR PROGRAMMING DISCRIMINATION 5 value and solving the one-dimensional optimization problem in (:c) min min i A iwmax j B jw m mx i= (A i w + ) + + k kx k= (B i w ) + This "secondary" optimization is not necessary in general, but for some problems it does improve the location of the optimal separator for a xed orientation of the planes. The objective of the one-dimensional problem (.c) is a piecewise-linear convex function which can be easily minimized by evaluating the function at the breakpoints = A w; :::; A m w; B w; :::; B k w. In order to set up the equivalent linear programming formulation to (.) we state rst a simple lemma that relates a norm minimization problem such as (.) to a constrained optimization problem devoid of norms of plus-functions. Lemma.3. Let g: R n! R m ; h: R n! R k and let S be a subset of R n : The problems (:9a) (:9b) min g(x)+ xs + h(x)+ min xs fey + ez y g(x); y 0; z h(x); z 0g have identical solution sets. Proof. The equivalence follows by noting that for the minimization problem (.9b), the optimal y; z and x must be related through the equalities y = g(x) + ; z = h(x) + : By using this lemma we can state an equivalent linear programming formulation to (.) as follows. Proposition.4. For > 0; > 0; the error-minimizing problem (:0a)) min w; is equivalent to the linear program Aw + e + e + + Bw e + e + (:0b) min w;;y;z ey + ez Aw e + y e; Bw + e + z e; y 0; z 0 The linear program (.0b) originally proposed by Smith with equal weights = = does not possess all the properties found in our linear program m + k with = m and = k : n ey (:) min + ez w;;y;z m k Aw e + y e; Bw + e + z e; y 0; z 0 The principal property that (.) has over other linear programs, including Smith's, is that for the linearly inseparable case it will always generate a nontrivial w without an extraneous constraint. To our knowledge no other linear programming o

6 6 K. P. BENNETT and O. L. MANGASARIAN formulation has this property for linearly inseparable sets. We establish this property by rst considering the linear program (.0b) for arbitrary positive weights and and showing under what conditions w = 0 constitutes a solution of the problem. Theorem.5. (Occurrence of the null solution w in linear program (.0b)) Let k m: The linear program (.0b) has a solution (w = 0; ; y; z) if and only if (:) ea = m vb; v 0; ev = ; v e m ; that is if the arithmetic mean of the points in A equals a convex combination of some points of B: When k = m; (.) degenerates to (:a) ea m = eb k ; that is the arithmetic mean of the points in A equals the arithmetic mean of the points in B: Proof. We note rst that k m does not result in any loss of generality because the roles of the sets A and B can be switched to obtain this inequality. Consider now the dual to the linear program (.0b) (:3) max u;v eu + ev A 0 u B 0 v = 0; eu + ev = 0; 0 u e; 0 v e The point (w = 0; ; y; z) is optimal for the primal problem (.0b) if and only if (:4a) m = min m k + = min ;y;z f ey + ez e + y e; e + z e; (y; z) 0g (:4b) = max eu + eva 0 u B 0 v = 0; eu + ev = 0; 0 u e; 0 v e u;v Since eu = ev and eu+ev = m; it follows that eu = ev = m: Since 0 u e; and so if any u i < then eu < m contradicting eu = m: Hence u = e and ev = eu = m: By normalizing u and v by dividing by m we obtain (.). When k = m; then from (.) we have that 0 v e : Since ev = ; it follows k that v = e and (.a) follows (.). k This theorem gives a theoretical explanation to some observed computational experience, namely that Smith's linear program (.0b) with = ; ended sometimes with the useless null w for real world linearly inseparable problems, whereas our linear program (.) never did. The reason for that is the rarity of the satisfaction of (.a) by real problems in contrast to the possibly frequent satisfaction of (.). We now proceed to our next results which show that when the null vector w = 0 constitutes a solution to the linear program (.0b), except for our proposed choice

7 ROBUST LINEAR PROGRAMMING DISCRIMINATION 7 of = and m = ; such w = 0 can be unique and nothing can be done to k alter it. (See Example.8 below.) However, for our linear program (.), even when the null w occurs in the rare case of (.a), e.g. in the contrived but classical Exclusive-Or example, there always exists an alternate non-null optimal w. (See Example.7 below.) These results are contained in the following theorem, examples and remarks. Theorem.6. (Nonuniqueness of the null w solution to the linear program (.)) The solution (w = 0; ; y; z) to (.) is not unique in w: Proof. Note from the rst equality of (.4a) with m = k = that when ( w = 0; ; y; z) is a solution to (.), then can be any point in [-, ]. In particular, take = 0: Then for this choice of w = 0; = 0; the corresponding optimal y; z for (.) are y = e; z = e and the active constraints are the rst two constraints of (.). Hence ( w; ; y; z) is unique in w if and only if the following system of linear inequalities has no solution (w; ; y; z) (:5a) e m y + e k z e m y + z k z Aw e + y A w e + y = e Bw + e + z B w + e + z = e w 6= w This is equivalent to the following system of linear inequalities having no solution (w; ; y; z) for each h in R n : (:5b) e m (y y) e (z z) 0 k A(w w) e( ) + (y y) 0 B(w w) + e( ) + (z z) 0 h(w w) > 0 By Motzkin's theorem of the alternative 8, (.5b) has no solution for a given h in R n if and only if the following system of linear inequalities does have a solution (; u; v) for that h in R n : (:6) A 0 u B 0 v = h eu + ev = 0 m e + u = 0 k e + v = 0 ; u; v 0 Obviously it is possible to choose h in R n such that (.6) has no solution, since there are h in R n that cannot be written as: (:7) h = A0 e m B 0 e k ; 0: Hence (.5b) has a solution for some h in R n. Consequently (.5a) has a solution and w = 0 is not unique.

8 8 K. P. BENNETT and O. L. MANGASARIAN We now apply this theorem to the classical Exclusive-Or example for which condition (.a) is satised and hence ( w = 0; ; y; z) is a solution to (.) which, however, is not unique in w = 0: Example.7. (Exclusive-Or) A = 0 0 ; B = 0 : 0 For this example ea = eb and ( w = 0; = 0; y = e; z = e) is a solution to the linear program (.) which can also be written in the equivalent (.) formulation of (:8) min w; w w w w + = However, the point w =, = is also optimal, because it gives the same minimum value of. The 45 direction in the wspace associated with this solution is useful in the multisurface method of pattern separation 7, since it can be used to generate the rst part of a piecewise-linearseparator. In practice the linear program package returned the optimal point w =, =, which generates another 45 direction that can also be used for piecewise-linear separation. We turn now to the case when m 6= k. In particular consider Smith's case of = = : A similar analysis to that of Theorem.6 does not give m + k guaranteed nonuniqueness of the solution ( w = 0; ; y; z) in w = 0 for the linear program (.0b) with = = : In fact, to the contrary, the analysis m + k shows that indeed w = 0 is unique under certain conditions which are satised by the following counterexample from Mangasarian et al 9 to Smith's claim that his linear program (.0b) with = = always generates a nonzero w: In m + k reality w = 0 is unique for this example. Example.8. (Unique w = 0 for Smith's LP (.0b), = = A = ; B = ; m = ; k = 3: m + k ) 4 For this problem the equivalent norm-minimization problem (.) to the Smith's LP (.0b) with = = m + is: k (:9) min w; f(w; ): = min w; 5 w w w w + + = 4 5

9 ROBUST LINEAR PROGRAMMING DISCRIMINATION 9 is achieved at w = 0, = : The uniqueness of this solution can be established by considering the subdierential (see Section 5..4, p.7 of Polyak 0 and Equation 4..4, p.363 of Fletcher 3 ) of the function f(w; ) at (0,) which is given ) = ` + 0 :` + 4`3 = 3 ` + 4`3 5 + ` ` `3 5 ` ` `3 with 0 ` e: Since ) with `3 0:8; ` = 5( `3); ` = 4`3 3; it follows that 0 interior (@f(0; )) and hence by Lemma 3, p.37 of Polyak 0, (0,) is a unique solution of (.9). (The uniqueness of (0,) can also be shown by considering the linear program (.0b)). Grinold 5 proposed the following linear program (:0) min w;; Aw e e 0; Bw + e e 0; (ea eb)w + (k m) = k + m 3 In Mangasarian et al 9 the example A = ; B = 4 5 was given to show 4 that the solution ( w = 0; = 5; = 5) is unique in w: In fact, it can be shown that ( w = 0; ; ) is always a solution of (.0) whenever there exist u such that ea eb (:) ua + = 0; eu = ; u 0; k > m k m Furthermore, w = 0 is unique if for each h in R n (:) A 0 ea eb + k m e u = h; has a solution u COMPUTATIONAL COMPARISONS In this section we give computational comparisons on two real-world databases: the Wisconsin Breast Cancer Database 4,3, and the Cleveland Heart Disease Database, using our proposed linear program (.), Smith's linear program (.0b) with = = m + k and the following linear programming formulation9 for multisurface discrimination (3:) max in max w;; Aw e; Bw e; e w e; wi = : Note that (3.) can be solved by solving n linear programs. The corresponding linear separation obtained from (3.) is (3:) wx = +

10 0 K. P. BENNETT and O. L. MANGASARIAN where ( w; ; ) is a solution of (3.). Problem (3.) is equivalent 9 to (3:3) max w;; Aw e; Bw e; w = which in turn is easily seen to be the following problem (3:4) max kwk= min A iw max B iw im ik Fig. summarizes the results obtained for two linearly inseparable databases using the three linear programs mentioned above. The Wisconsin Breast Cancer Database consists of 566 points of which 354 are benign and are malignant, all in a 9-dimensional real space. The Cleveland Heart Disease Database consists of 97 points in a 3-dimensional real space, of which 37 are negative and 60 are positive. Our testing methodology consisted of dividing each set randomly into a training set consisting of 67% of the data and a testing set consisting of the remaining 33%. Each linear programming formulation was run on the training set and the resulting separator tested on the testing set. This was repeated ten times and the average results of the ten runs are depicted in Fig.. No "secondary" optimization using (.c) was performed for any method. For each database our linear program (.) (referred to as MSM, multisurface method -norm) outperformed both Smith's linear program (.0b) with = ; and the linear programming method of (3.), referred to as MSM. The average run times for MSM and Smith are very close: 3.84 and 3.89 seconds respectively on a DEC station 5000/5 for the Wisconsin Breast Cancer Database, while the corresponding time for MSM was seconds. For the Cleveland Heart Disease Database the corresponding times are:.8,.89 and 53.8 seconds respectively. Note that the percent error of MSM on the testing set was better than that of Smith and considerably better than that of MSM on both databases. 4 CONCLUSIONS We have presented a robust linear program which always generates a linear surface as an \optimal" separator for two linearly inseparable sets. The \optimality" of the separator consists in minimizing a weighted average sum of the violations of the points lying on the wrong side of the separator. By using an appropriately weighted sum, we have overcome the problem of the null solution which has plagued previous linear programming approaches. These approaches either left this diculty unresolved or imposed an extraneous linear constraint which never resolved the problem completely. The fact that computational results on real-world problems give an edge to our linear program over Smith's and a substantial edge over another linear programming approach makes it, in our opinion, a suitable linear program for the linearly inseparable case.

11 ROBUST LINEAR PROGRAMMING DISCRIMINATION Wisconsin Breast Cancer Database Training Set Testing Set Percent Errors MSM SMITH MSM Percent Errors Cleveland Heart Disease Database 5.9 Training Set 5.03 Testing Set MSM SMITH MSM Figure : Comparison of Three Linear Programming Discriminators for Linearly Inseparable Sets

12 K. P. BENNETT and O. L. MANGASARIAN ACKNOWLEDGEMENTS This material is based on research supported by Air Force Oce of Scientic Research Grant AFOSR , National Science Foundation Grant CCR-9080, and Air Force Laboratory Graduate Fellowship Program Grant SSAN REFERENCES. K. P. Bennett and O. L. Mangasarian, Neural Network Training Via Linear Programming, in P. M. Pardalos (Ed.), Advances in Optimization and Parallel Computing, North Holland, Amsterdam 99, pp R. Detrano, A. Janosi, W. Steinbrunn, M. Psterer, J. Schmid, S. Sandhu, K. Guppy, S. Lee, V. Froelicher, International Application of a New Probability Algorithm for the Diagnosis of Coronary Artery Disease, American Journal of Cardiology 64, 989, pp R. Fletcher, Practical Methods in Optimization, Second Edition, Wiley, New York F. Glover, Improved Linear Programming Models for Discriminant Analysis, Decision Sciences,,4, 990, pp R. C. Grinold, Mathematical Programming Methods of Pattern Classication, Management Science 9, 97, pp O. L. Mangasarian, Linear and Nonlinear Separation of Patterns by Linear Programming, Operations Research 3, 965, pp O. L. Mangasarian, Multisurface Method of Pattern Separation, IEEE Transactions on Information Theory IT-4(6), 968, pp O. L. Mangasarian, Nonlinear Programming, McGraw-Hill, New York 969, pp O. L. Mangasarian, R. Setiono, and W. H. Wolberg, Pattern Recognition via Linear Programming: Theory and Application to Medical Diagnosis, in: Large-Scale Numerical Optimization, Thomas F. Coleman and Yuying Li, (Eds.), SIAM, Philadelphia 990, pp B. T. Polyak, Introduction to Optimization, Optimization Software, New York D. E. Rumelhart, G. E. Hinton, and J. L. McClelland, Learning Internal Representations, in Parallel Distributed Processing, D. E. Rumelhart and J. L. McClelland (Eds.), Vol. I, M.I.T. Press, Cambridge, Massachusetts 969, pp F. W. Smith, Pattern Classier Design by Linear Programming, IEEE Transactions on Computers C-7, 4, 968, pp W. H. Wolberg AND O. L. Mangasarian, Multisurface Method of Pattern Separation Applied to Breast Cytology Diagnosis, Proceeding of National Academy of Sciences U.S.A. 87, 990, pp W. H. Wolberg, M. S. Tanner and W. Y. Loh, Diagnostic Schemes for Fine Needle Aspirates of Breast Masses, Analytical and Quantitative Cytology and Histology 0, 988, pp. 5-8.

Optimization in Machine Learning. O. L. Mangasarian y. January approaches [24, 2] that render it more tractable.

Optimization in Machine Learning O. L. Mangasarian y Mathematical Programming Technical Report 95-01 January 1995 1 Introduction Optimization has played a signicant role in training neural networks [23].