Training algorithms for fuzzy support vector machines with nois

Size: px

Start display at page:

Download "Training algorithms for fuzzy support vector machines with nois"

Lesley Booth
5 years ago
Views:

1 Training algorithms for fuzzy support vector machines with noisy data Presented by Josh Hoak Chun-fu Lin 1 Sheng-de Wang 1 1 National Taiwan University 13 April 2010

2 Prelude Problem: SVMs are particularly susceptible to outliers. Fuzzy SVMs are a method to cope with outliers in SVMs.

3 A Motivating Example

4 Recap: Linear Support Vector Machine Training Data: (y 1, x 1 )... (y m, x m ) : x i R n, y i {1, 1} Goal: Draw a hyperplane separating data with two classes; that is, find a vector w and a translation b such that w x + b = 0, where x is an example of from the training data, subject to the constraint: y i (w x i + b) 1, for i = 1,... l

5 Non-linear SVM Non-linear Mapping: Transform the with some function data and then try to separate. Let φ : R n F, where F indicates the feature space. Then, z = φ(x) Error term: Add an error term ξ to the constraint: y i (w z i + b) 1 ξ i, i = 1,..., l

6 Non-linear SVMs continued Equivalently, we can minimize: 1 l 2 w w C ξ i, i=1 subject to y i (w z i b) 1 ξ i

7 Non-linear SVMs continued Kernel Trick: Find a kernel function K(, ) : R n R n F such that: K(x i, x j ) = φ(x i ) φ(x j ) = z i z j Reformulation: Optimal hyperplane is then reformulated as finding: f H (x) = l α i y i K(x i, x) + b i=1 Decision function: f D (x) = sign(f H (X))

8 Training Data (y 1, x 1, s 1 ),..., (y l, x l, s l ), σ s i 1, (σ > 0) Fuzzy membership: s i is viewed as our confidence that the corresponding point x i has class y i

9 Optimal Hyperplane: The solution to minimizing: 1 l 2 w w + C s i ξ i, i=1 constrained by y i (w z i + b) 1 ξ Equivalently we can write: 1 l 2 w w + C ψ(ξ i ) I =1

10 A more familiar formulation Maximize: W (α) = l α i 1 2 i=1 l l α i α j y i y j K(x i x j ) i=1 j=1 Subject to: l y i α i = 0, i=1 0 α i s i C, i = 1,..., l

11 How do we find the confidence values?

12 : The error function Note: We model the theoretical error ψ(ξ i ) by the probability that a point is noise p x (x i ). The error function becomes l i=1 p x(x i )ξ i One model: 1, if h(x i ) > h c, p x (x i ) = σ, if h(x i ) < h T, ( ) d σ + (1 σ) h(xi ) h T h C h T, otherwise. New definitions: h c is the confidence factor, h T is the trashy factor, h(x) is a heuristic function.

13 : The error function

14 Generating fuzzy memberships p x (x) We choose σ > 0 as a lower bound. Let s suppose that our fuzzy membership (or outlier status) is based on only one feature t. Then, we have: s i = h(x i ) = f (t i ).

15 Generating fuzzy memberships (continued) Let the maximum of these be t max and let t min be the minimum; when t i = t min, we want the output to be σ. If we make s i be a linear function of t then we get s i = f (t i ) = at i + b Solving the system of equations: σ = a(t min ) + b 1 = a(t max ) + b

16 Generating fuzzy memberships (continued) Solving, we get: s i = f (t i ) = 1 σ t max t min t i + t maxσ t min t max t min If we wish to make the function polynomial, we get: ( ) ti t 2 min s i = f (t i ) = (1 σ) + σ t max t min

17 Fuzzy SVM: The heuristic function Strategy 1: Kernel-target alignment. Defined as A KT = l i=1 f K (x i, y i ) l l i,j=1 K 2 (x i, x j ) where f K (x i, y i ) = l j=1 y iy j K(x i, x j ) Idea: Use f K (x i, y i ) as the heuristic function.

18 : The heuristic function Strategy 2: k-nn. Find the nearest neighbors of a data point x i (of the same class). Assume that the data point with fewer nearest neighbors (of the same class) has higher probability of being noisy data. Let the heuristic function be: h(x i ) = n i, where n i is the number of nearest neighbors.

19 Overall Procedure 1. Use the original algorithm of SVMs to get the optimal kernel parameters and the regularization parameter C. 2. Fix the kernel parameters and the regularization parameter C from (1), and then find the other parameters in FSVMs. 2.1 Define the heuristic function. 2.2 Use exhaustive search to find the h T, h C, d, and σ.

20 Results Table 1: Error rates for SVMs and FSVMs using KT and k-nn TR SVMs KT k-nn Banana ± 0.7 *10.4 ± ± 0.6 B. Cancer ± ± 4.4 *25.2 ± 4.1 Diabetes ± 1.7 *23.3 ± ± 1.7 German ± 2.1 *23.3 ± ± 2.1 Heart ± 3.3 *14.2 ± ± 2.1 Image ± 0.6 *2.9 ± Ringnorm 0.0 *1.7 ± F. Solar 32.6 *32.4 ± ± ± 1.8 Splice 0.0 *10.9 ± Thyroid ± 2.2 *4.± Titanic ± 1.0 *22.3±0.9 *22.3 ± 1.1 Twonorm ± 0.2 *2.4 ± ± 0.2 Waveform 3.5 *9.9 ± ± data sets from the UCI, DELVE and STAT- LOG

21 Ensemble Method Use the OAA (One Against All) method of multiple-classification. Given M classes, we construct M binary SVM classifiers that separate one class from the rest. The class associated with the classifier that outputs the highest value given an example is then chosen. Fuzzy Membership Function: When training, { 1 if the output of the ensemble on x is 1, Fuzz(x) = h if the output of ensemble on x is 1.

22 Confusion Matrix Actual Class Predicted Class Earn Acq Money-fx Grain Total Earn Acq Money-Fx Grain Total Note: Overall Accuracy = Data: Reuters Documents

23 Results Macro-average perf. of OAA-SVM and OAA-FSVM Classifier 4-fold Precision Recall F Measure OAA-FSVM(1, 0.5) OAA-FSVM(1, 0.6) OAA-SVM

24 Thoughts Statistically significant results? Diminishing returns? Cost? How much do outliers affect the model?

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Data Mining Support Vector Machines Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 Support Vector Machines Find a linear hyperplane