Study on Classification Methods Based on Three Different Learning Criteria. Jae Kyu Suhr

Similar documents
Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks

Optimizing Data Transformation for Binary Classification

Improving the Expert Networks of a Modular Multi-Net System for Pattern Recognition

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Neural Networks and the Back-propagation Algorithm

Weight Initialization Methods for Multilayer Feedforward. 1

Electric Load Forecasting Using Wavelet Transform and Extreme Learning Machine

Recurrence Enhances the Spatial Encoding of Static Inputs in Reservoir Networks

Learning Kernel Parameters by using Class Separability Measure

Machine Learning Linear Classification. Prof. Matteo Matteucci

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

Support Vector Machine via Nonlinear Rescaling Method

Necessary Corrections in Intransitive Likelihood-Ratio Classifiers

Sparse Support Vector Machines by Kernel Discriminant Analysis

EEE 241: Linear Systems

Intelligent Modular Neural Network for Dynamic System Parameter Estimation

MLPR: Logistic Regression and Neural Networks

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Learning Methods for Linear Detectors

Non-parametric Classification of Facial Features

Combination Methods for Ensembles of Multilayer Feedforward 1

Support Vector Machine (continued)

Linear Classifiers as Pattern Detectors

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Discriminative Models

Biometric scores fusion based on total error rate minimization

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

A BAYESIAN APPROACH FOR EXTREME LEARNING MACHINE-BASED SUBSPACE LEARNING. Alexandros Iosifidis and Moncef Gabbouj

Machine Learning : Support Vector Machines

Discriminant Analysis and Statistical Pattern Recognition

PATTERN CLASSIFICATION

Optimization Approximation Solution for Regression Problem Based on Extremal Learning Machine

p(d θ ) l(θ ) 1.2 x x x

CS 195-5: Machine Learning Problem Set 1

Index of Balanced Accuracy: A Performance Measure for Skewed Class Distributions

Bayesian Reasoning and Recognition

Multilayer Perceptron = FeedForward Neural Network

Introduction to Gaussian Processes

Bayesian Decision Theory

NEAREST NEIGHBOR CLASSIFICATION WITH IMPROVED WEIGHTED DISSIMILARITY MEASURE

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

Pattern Classification

Multivariate statistical methods and data mining in particle physics

Course 395: Machine Learning - Lectures

Discriminant Kernels based Support Vector Machine

Discriminative Models

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Unsupervised Learning with Permuted Data

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

Comparison of Log-Linear Models and Weighted Dissimilarity Measures

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

EVALUATING MISCLASSIFICATION PROBABILITY USING EMPIRICAL RISK 1. Victor Nedel ko

Linear & nonlinear classifiers

Lossless Online Bayesian Bagging

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Classification with Kernel Mahalanobis Distance Classifiers

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Benchmarking Functional Link Expansions for Audio Classification Tasks

A TWO-STAGE COMMITTEE MACHINE OF NEURAL NETWORKS

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Generalization to Unseen Cases

1162 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 5, SEPTEMBER The Evidence Framework Applied to Support Vector Machines

COMS 4771 Introduction to Machine Learning. Nakul Verma

Multi-layer Neural Networks

Multilayer Neural Networks

From perceptrons to word embeddings. Simon Šuster University of Groningen

Notes on Discriminant Functions and Optimal Classification

The Nearest Feature Midpoint - A Novel Approach for Pattern Classification. Abstract

DESIGNING RBF CLASSIFIERS FOR WEIGHTED BOOSTING

A Simple Implementation of the Stochastic Discrimination for Pattern Recognition

Neural Network to Control Output of Hidden Node According to Input Patterns

Predicting the Probability of Correct Classification

A Metric Approach to Building Decision Trees based on Goodman-Kruskal Association Index

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Selection of Classifiers based on Multiple Classifier Behaviour

LBR-Meta: An Efficient Algorithm for Lazy Bayesian Rules

Supervised locally linear embedding

Dynamic Linear Combination of Two-Class Classifiers

Linear Methods for Classification

Diversity-Based Boosting Algorithm

Kernel-based Feature Extraction under Maximum Margin Criterion

Extreme Learning Machine: RBF Network Case

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Gaussian discriminant analysis Naive Bayes

Voting Massive Collections of Bayesian Network Classifiers for Data Streams

A New Wrapper Method for Feature Subset Selection

Ch 4. Linear Models for Classification

Linear Discrimination Functions

Artificial Neural Networks

Minimal Attribute Space Bias for Attribute Reduction

Divergence based Learning Vector Quantization

STUDY ON METHODS FOR COMPUTER-AIDED TOOTH SHADE DETERMINATION

k k k 1 Lecture 9: Applying Backpropagation Lecture 9: Applying Backpropagation 3 Lecture 9: Applying Backpropagation

Maximum Likelihood Estimation. only training data is available to design a classifier

Integrating Global and Local Structures: A Least Squares Framework for Dimensionality Reduction

Non-Iterative Heteroscedastic Linear Dimension Reduction for Two-Class Data

Two-Layered Face Detection System using Evolutionary Algorithm

A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems

Transcription:

Study on Classification Methods Based on Three Different Learning Criteria Jae Kyu Suhr

Contents Introduction Three learning criteria LSE, TER, AUC Methods based on three learning criteria LSE:, ELM TER: TER-, TER-ELM AUC: AUC- Experiment Setup: Data sets, Parameter setting Result: Normalization, TER and LAUC results

Introduction Pattern classification is a widely researched topic for decision making. In pattern classification, empirical learning constitutes a major paradigm. Under this paradigm, a classifier is designed to minimize a certain cost function (learning criteria). Least Squares Error (LSE) is a commonly used cost function. The reasons for the popularity of LSE are its simplicity, clear physical meaning, and tractability for analysis. The embedment of nonlinearities into linear models has widened the application of LSE cost function.

Introduction Recently, two efficient basis functions were proposed. Reduced multinomial Model () [] Basis function: reduced version of full polynomial. Extreme Learning Machine (ELM) [3] Basis function: Single-hidden Layer Feedforward Neural networks (SLFNs). However LSE s limitation becomes apparent when high accuracy is required. LSE cost function tries to minimize the fitting error rather than the classification error which is desired to be minimized for classification task.

Introduction Three main approaches have been adopted to overcome this drawback of LSE cost function. Discriminant approach: FDA, GDA Structural approach: SVD Classification-error approach In the third approach, two cost functions were recently proposed. Total Error Rate (TER) -based approach (TER-, TER-ELM) [4,5] Maximize the total error rate in the training stage. Area under the ROC curve (AUC) -based approach (AUC-) [6] Maximize the area under the ROC curve in the training stage. Main breakthrough is a smooth approximate formulation for calculating TER and AUC. Quadratic approximation for counting process Closed-form solution.

Introduction In this paper, Five classification methods based on three different learning criteria were evaluated. LSE criteria:, ELM TER criteria: TER-, TER-ELM AUC criteria: AUC- Five two-class problems in the UCI database were used for the method evaluation. Pima-dabetes, SPECT-heart, StatLog-heart, Tic-tac-toe, and Wdbc The efficient way to normalize feature vectors for and ELM-based methods was discussed.

LSE-based Method Parametric model adopting a basis expansion term: K g( α, x) = α p ( x) = p( x) α k = LSE cost function b J ( α) = y Pα + α k k Solution for LSE which minimizes J T T αˆ = ( PP+bI) Py basis function ELM basis function fˆ ( α, x) r l r k kjx j rl j x x xl k= j= j= = α + α + α + ( + + + ) r T j ( α j x)( x x xl), l, r. j= + + + + j φ( w x+ b) φ( wp xp + bp) H = φ( m + b) φ( p m + bp) w x w x m p

Total Error Rate + TER( α, x, x ) TER-based Method + = ( (, ) ) + ( (, )) m m m + Lg j τ Lτ g α x + α xi j= m j= When using g( α, x) = p( ) + TER( α, x, x ) b + = α + px α + + px α + m m m + ( j) τ η τ ( i ) η + j= m i= Optimal parameter x α T T τ η T τ + η T j j + i i j + i α ( ) ( ) = b + + + I m p p m p p m p m p T T τ η T τ + η T + + + + + + α ( ) ( ) = b + + + I m P P m P P m P m P and quadratic approximation

AUC-based Method Area under ROC curve + m m + AUC( x, x ) mm + arg min AAC( α, x, x ) α = x x + + g( i ) > g( j ) i= j= + m m + = arg min u g(, j) g(, i ) + mm α x α x α i= j= Optimal parameter + m m T α = bi+ ( j i) ( j i) + p p p p mm i= j= + m m η ( j i) + p p mm i= j= T When using a quadratic approximation + arg min AAC( α, x, x ) α + m m b + arg min α + ( ( j) ( i )) η + mm px px α + α i= j= TER-based threshold + τ = px ( ) α + px ( ) α m m m + j + i j= m i=

Method Description Basis Learning criteria LSE TER AUC function [] TER- [4] AUC- [6] SLFNs ELM [3] TER-ELM [5] -

Data Set Description DB name Number of samples Number of features Number of classes Missing feature values Pima-diabetes 768 8 (65% / 35%) None Wisconsin Diagnostic Breast Cancer 569 3 (63% / 36%) None SPECT-heart 67 (79% / %) None Statlog-heart 7 3 (56% / 44%) None Tic-Tac-Toe Endgame 958 9 (65% / 35%) None

Experimental Setup Validation: -fold cross validation Run: runs for all method and all setting, TER-, AUC- ~ order TER-ELM Activation function: sigmoid ~ hidden neurons TER-, TER-ELM τ = η =.5 AUC- η = Data normalization: min-max, TER-, TER-ELM Data normalization was applied after making P matrix ELM, TER-ELM Data normalization was applied before making H matrix

Evaluation Criteria Total Error Rate (TER) total number of misclassified data sample TER (%) = total number of data sample LAUC : Negative base logarithm of AUC values Because the AUC value shows little difference between two biometrics which have high performances LAUC = log ( AUC)

Normalization Procedure Min-max normalization technique in three different ways: No normalization Normalization before making P or H matrix Normalization after making P or H matrix.

case test error (%) 3 5 5 5 wdbc no before P matrix after P matrix 4 6 8 order(~) Normalization after making P matrix has the best performance. P matrix of is produced by multiplying and adding many feature values This leads a singularity problem of the matrix inversion This finally causes the parameter estimation to be unstable. Normalization after making P matrix is better than normalization before making P matrix Even if the feature vectors are normalized before making P matrix Feature values are multiplied and added when producing P matrix. This can also cause the singularity problem.

ELM case test error (%) 3 5 5 5 wdbc no before H matrix after H matrix 4 6 8 order(~) Normalization before making H matrix has the best performance. No normalization and normalization after making H matrix have almost the same performances

ELM case number of occurrence number of occurrence Normalization after making H matrix 4 8 6 4 3 4 5 feature value Input feature number of occurrence 8 6 4 number of occurrence -5 5 feature value Input weight and bias 8 6 4..4.6.8 feature value Sigmoid activation function Normalization before making H matrix 4 8 6 4 number of occurrence 5 4 3 number of occurrence 6 5 4 3 Almost no difference number of occurrence number of occurrence 8 6 4 5 5..4.6.8 feature value 5 Min-max normalization Much informative 3 4 5 feature value Input feature..4.6.8 feature value Min-max normalization -5 5 feature value Input weight and bias..4.6.8 feauture value Sigmoid activation function

Comparison Results test error (%) 9 8 7 6 5 4 ELM TER- TER-ELM AUC- Pima-diabetes LAUC.5 ELM TER- TER-ELM AUC- Pima-diabetes 3 4 6 8 order(~), hidden neuron(~).5 4 6 8 order(~), hidden neuron(~) test error (%) 5 4 3 ELM TER- TER-ELM AUC- SPECT-heart LAUC.5.5 ELM TER- TER-ELM AUC- SPECT-heart 4 6 8 order(~), hidden neuron(~) 4 6 8 order(~), hidden neuron(~)

Comparison Results test error (%) 4 35 3 5 ELM TER- TER-ELM AUC- StatLog-heart LAUC.5 ELM TER- TER-ELM AUC- StatLog-heart 5.5 test error (%) 4 3 4 6 8 order(~), hidden neuron(~) tic-tac-toe ELM TER- TER-ELM AUC- LAUC.5.5 4 6 8 order(~), hidden neuron(~) tic-tac-toe ELM TER- TER-ELM AUC-.5 4 6 8 order(~), hidden neuron(~) 4 6 8 order(~), hidden neuron(~)

Comparison Results test error (%) 8 7 6 5 4 ELM TER- TER-ELM AUC- wdbc LAUC.5..5. ELM TER- TER-ELM AUC- wdbc 3.5 4 6 8 order(~), hidden neuron(~) 4 6 8 order(~), hidden neuron(~)

Conclusions For data normalization, Normalization should be applied after making P matrix when using basis function. Normalization should be applied before making H matrix when using ELM basis function. For two class problems, All methods have a similar results. Especially, TER- and AUC- have almost the same performance in terms of TER and LAUC. TER: find the optimal α with a fixed г to minimize the total error rate. AUC: find the optimal г with a fixed α to minimize the total error rate. TER and AUC show a very similar trend.

References [] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, seconded. John Wiley & Sons,. [] K.-A. Toh, Q.-L. Tran, and D. Srinivasan, Benchmarking a reduced multivariate polynomial pattern classifier, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 6, no. 6, pp. 74 755, 4. [3] Huang, G.-B., Zhu, Q.-Y., & Siew, C.-K. (6). Extreme learning machine: Theory and applications. Neurocomputing, 7, 489 5. [4] K.-A. Toh and H.-L. Eng, Between classification-error approximation and weighted least-squares learning, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 3, no. 4, pp. 658-669, 8. [5] K.-A. Toh, Deterministic Neural Classification, Neural Computation, 8. [6] K.-A. Toh, J. Kim and S. Lee, Maximizing Area Under ROC Curve for Biometric Scores Fusion, Pattern Recognition, 8. [7] K.-A. Toh, Learning from Target Knowledge Approximation, Proc. First IEEE Conf. Industrial Electronics and Applications, pp. 85-8, May 6. [8] J.A. Hanley, B.J. McNeil, The meaning and use of the area under a receiver operating characteristic (ROC) curve, Radiology 43 (98) 9--36. [9] K.-A. Toh, Between AUC Based and Error Rate Based Learning, The 3rd IEEE Conference on Industrial Electronics and Applications (ICIEA), Singapore, June 8. [] D.J. Newman, S. Hettich, C.L. Blake, and C.J. Merz, UCI Repository of Machine Learning Databases, Univ. of California, Dept. of Information and Computer Sciences, http://www.ics.uci.edu/~mlearn/mlrepository.html, 998.

THE END