Study on Classification Methods Based on Three Different Learning Criteria Jae Kyu Suhr
Contents Introduction Three learning criteria LSE, TER, AUC Methods based on three learning criteria LSE:, ELM TER: TER-, TER-ELM AUC: AUC- Experiment Setup: Data sets, Parameter setting Result: Normalization, TER and LAUC results
Introduction Pattern classification is a widely researched topic for decision making. In pattern classification, empirical learning constitutes a major paradigm. Under this paradigm, a classifier is designed to minimize a certain cost function (learning criteria). Least Squares Error (LSE) is a commonly used cost function. The reasons for the popularity of LSE are its simplicity, clear physical meaning, and tractability for analysis. The embedment of nonlinearities into linear models has widened the application of LSE cost function.
Introduction Recently, two efficient basis functions were proposed. Reduced multinomial Model () [] Basis function: reduced version of full polynomial. Extreme Learning Machine (ELM) [3] Basis function: Single-hidden Layer Feedforward Neural networks (SLFNs). However LSE s limitation becomes apparent when high accuracy is required. LSE cost function tries to minimize the fitting error rather than the classification error which is desired to be minimized for classification task.
Introduction Three main approaches have been adopted to overcome this drawback of LSE cost function. Discriminant approach: FDA, GDA Structural approach: SVD Classification-error approach In the third approach, two cost functions were recently proposed. Total Error Rate (TER) -based approach (TER-, TER-ELM) [4,5] Maximize the total error rate in the training stage. Area under the ROC curve (AUC) -based approach (AUC-) [6] Maximize the area under the ROC curve in the training stage. Main breakthrough is a smooth approximate formulation for calculating TER and AUC. Quadratic approximation for counting process Closed-form solution.
Introduction In this paper, Five classification methods based on three different learning criteria were evaluated. LSE criteria:, ELM TER criteria: TER-, TER-ELM AUC criteria: AUC- Five two-class problems in the UCI database were used for the method evaluation. Pima-dabetes, SPECT-heart, StatLog-heart, Tic-tac-toe, and Wdbc The efficient way to normalize feature vectors for and ELM-based methods was discussed.
LSE-based Method Parametric model adopting a basis expansion term: K g( α, x) = α p ( x) = p( x) α k = LSE cost function b J ( α) = y Pα + α k k Solution for LSE which minimizes J T T αˆ = ( PP+bI) Py basis function ELM basis function fˆ ( α, x) r l r k kjx j rl j x x xl k= j= j= = α + α + α + ( + + + ) r T j ( α j x)( x x xl), l, r. j= + + + + j φ( w x+ b) φ( wp xp + bp) H = φ( m + b) φ( p m + bp) w x w x m p
Total Error Rate + TER( α, x, x ) TER-based Method + = ( (, ) ) + ( (, )) m m m + Lg j τ Lτ g α x + α xi j= m j= When using g( α, x) = p( ) + TER( α, x, x ) b + = α + px α + + px α + m m m + ( j) τ η τ ( i ) η + j= m i= Optimal parameter x α T T τ η T τ + η T j j + i i j + i α ( ) ( ) = b + + + I m p p m p p m p m p T T τ η T τ + η T + + + + + + α ( ) ( ) = b + + + I m P P m P P m P m P and quadratic approximation
AUC-based Method Area under ROC curve + m m + AUC( x, x ) mm + arg min AAC( α, x, x ) α = x x + + g( i ) > g( j ) i= j= + m m + = arg min u g(, j) g(, i ) + mm α x α x α i= j= Optimal parameter + m m T α = bi+ ( j i) ( j i) + p p p p mm i= j= + m m η ( j i) + p p mm i= j= T When using a quadratic approximation + arg min AAC( α, x, x ) α + m m b + arg min α + ( ( j) ( i )) η + mm px px α + α i= j= TER-based threshold + τ = px ( ) α + px ( ) α m m m + j + i j= m i=
Method Description Basis Learning criteria LSE TER AUC function [] TER- [4] AUC- [6] SLFNs ELM [3] TER-ELM [5] -
Data Set Description DB name Number of samples Number of features Number of classes Missing feature values Pima-diabetes 768 8 (65% / 35%) None Wisconsin Diagnostic Breast Cancer 569 3 (63% / 36%) None SPECT-heart 67 (79% / %) None Statlog-heart 7 3 (56% / 44%) None Tic-Tac-Toe Endgame 958 9 (65% / 35%) None
Experimental Setup Validation: -fold cross validation Run: runs for all method and all setting, TER-, AUC- ~ order TER-ELM Activation function: sigmoid ~ hidden neurons TER-, TER-ELM τ = η =.5 AUC- η = Data normalization: min-max, TER-, TER-ELM Data normalization was applied after making P matrix ELM, TER-ELM Data normalization was applied before making H matrix
Evaluation Criteria Total Error Rate (TER) total number of misclassified data sample TER (%) = total number of data sample LAUC : Negative base logarithm of AUC values Because the AUC value shows little difference between two biometrics which have high performances LAUC = log ( AUC)
Normalization Procedure Min-max normalization technique in three different ways: No normalization Normalization before making P or H matrix Normalization after making P or H matrix.
case test error (%) 3 5 5 5 wdbc no before P matrix after P matrix 4 6 8 order(~) Normalization after making P matrix has the best performance. P matrix of is produced by multiplying and adding many feature values This leads a singularity problem of the matrix inversion This finally causes the parameter estimation to be unstable. Normalization after making P matrix is better than normalization before making P matrix Even if the feature vectors are normalized before making P matrix Feature values are multiplied and added when producing P matrix. This can also cause the singularity problem.
ELM case test error (%) 3 5 5 5 wdbc no before H matrix after H matrix 4 6 8 order(~) Normalization before making H matrix has the best performance. No normalization and normalization after making H matrix have almost the same performances
ELM case number of occurrence number of occurrence Normalization after making H matrix 4 8 6 4 3 4 5 feature value Input feature number of occurrence 8 6 4 number of occurrence -5 5 feature value Input weight and bias 8 6 4..4.6.8 feature value Sigmoid activation function Normalization before making H matrix 4 8 6 4 number of occurrence 5 4 3 number of occurrence 6 5 4 3 Almost no difference number of occurrence number of occurrence 8 6 4 5 5..4.6.8 feature value 5 Min-max normalization Much informative 3 4 5 feature value Input feature..4.6.8 feature value Min-max normalization -5 5 feature value Input weight and bias..4.6.8 feauture value Sigmoid activation function
Comparison Results test error (%) 9 8 7 6 5 4 ELM TER- TER-ELM AUC- Pima-diabetes LAUC.5 ELM TER- TER-ELM AUC- Pima-diabetes 3 4 6 8 order(~), hidden neuron(~).5 4 6 8 order(~), hidden neuron(~) test error (%) 5 4 3 ELM TER- TER-ELM AUC- SPECT-heart LAUC.5.5 ELM TER- TER-ELM AUC- SPECT-heart 4 6 8 order(~), hidden neuron(~) 4 6 8 order(~), hidden neuron(~)
Comparison Results test error (%) 4 35 3 5 ELM TER- TER-ELM AUC- StatLog-heart LAUC.5 ELM TER- TER-ELM AUC- StatLog-heart 5.5 test error (%) 4 3 4 6 8 order(~), hidden neuron(~) tic-tac-toe ELM TER- TER-ELM AUC- LAUC.5.5 4 6 8 order(~), hidden neuron(~) tic-tac-toe ELM TER- TER-ELM AUC-.5 4 6 8 order(~), hidden neuron(~) 4 6 8 order(~), hidden neuron(~)
Comparison Results test error (%) 8 7 6 5 4 ELM TER- TER-ELM AUC- wdbc LAUC.5..5. ELM TER- TER-ELM AUC- wdbc 3.5 4 6 8 order(~), hidden neuron(~) 4 6 8 order(~), hidden neuron(~)
Conclusions For data normalization, Normalization should be applied after making P matrix when using basis function. Normalization should be applied before making H matrix when using ELM basis function. For two class problems, All methods have a similar results. Especially, TER- and AUC- have almost the same performance in terms of TER and LAUC. TER: find the optimal α with a fixed г to minimize the total error rate. AUC: find the optimal г with a fixed α to minimize the total error rate. TER and AUC show a very similar trend.
References [] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, seconded. John Wiley & Sons,. [] K.-A. Toh, Q.-L. Tran, and D. Srinivasan, Benchmarking a reduced multivariate polynomial pattern classifier, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 6, no. 6, pp. 74 755, 4. [3] Huang, G.-B., Zhu, Q.-Y., & Siew, C.-K. (6). Extreme learning machine: Theory and applications. Neurocomputing, 7, 489 5. [4] K.-A. Toh and H.-L. Eng, Between classification-error approximation and weighted least-squares learning, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 3, no. 4, pp. 658-669, 8. [5] K.-A. Toh, Deterministic Neural Classification, Neural Computation, 8. [6] K.-A. Toh, J. Kim and S. Lee, Maximizing Area Under ROC Curve for Biometric Scores Fusion, Pattern Recognition, 8. [7] K.-A. Toh, Learning from Target Knowledge Approximation, Proc. First IEEE Conf. Industrial Electronics and Applications, pp. 85-8, May 6. [8] J.A. Hanley, B.J. McNeil, The meaning and use of the area under a receiver operating characteristic (ROC) curve, Radiology 43 (98) 9--36. [9] K.-A. Toh, Between AUC Based and Error Rate Based Learning, The 3rd IEEE Conference on Industrial Electronics and Applications (ICIEA), Singapore, June 8. [] D.J. Newman, S. Hettich, C.L. Blake, and C.J. Merz, UCI Repository of Machine Learning Databases, Univ. of California, Dept. of Information and Computer Sciences, http://www.ics.uci.edu/~mlearn/mlrepository.html, 998.
THE END