A Hybrid Neuron with Gradient-based Learning for Binary Classification Problems

Size: px

Start display at page:

Download "A Hybrid Neuron with Gradient-based Learning for Binary Classification Problems"

Vivian Hutchinson
5 years ago
Views:

1 A Hybrid Neuron with Gradient-based Learning for Binary Classification Problems Ricardo de A. Araújo 1,2, Adriano L. I. Oliveira 1, Silvio R. L. Meira 1 1 Informatics Center Federal University of Pernambuco Recife, PE, Brazil 2 Informatics Department Federal Institute of Sertão Pernambucano Ouricuri, PE, Brazil ricardo.araujo@ifsertao-pe.edu.br or {raa,alio,srlm}@cin.ufpe.br Abstract. In this paper we present a hybrid neuron based on principles of mathematical morphology lattice theory to solve binary classification problems. For the model design we present a gradient-based method using the back propagation algorithm ideas a systematic approach to overcome the problem of nondifferentiability of morphological operations. Furthermore, we conduct an experimental analysis using two relevant binary classification problems the obtained results are discussed compared with those obtained by established techniques in the literature. 1. Introduction The perceptron is the most known artificial neuron proposed in the literature [Haykin 1998, Haykin 2007]. It was inspired by the concept of biological neurons it is able to solve linear classification problems [Haykin 1998, Haykin 2007]. A particular class of artificial neurons based on the framework of mathematical morphology (MM) [Maragos 1989] under context of lattice theory [Heijmans 1994], called morphological perceptrons (MPs) [Ritter Urcid 2003, Sussner Esmi 2011], have been successfully applied as solution of linear nonlinear problems [Ritter et al. 1997, Ritter et al. 1998,Sussner 1998a,Petridis Kaburlasos 1998,Kaburlasos Petridis 2000,Khabou Gader 2000, Hocaoglu Gader 2003, Sussner Valle 2006a, Sussner Valle 2006b,de A. Araújo et al. 2006b,de A. Araújo et al. 2006c,de A. Araújo et al. 2006a,Sussner Valle 2007, Valle Sussner 2008, Silva Sussner 2008, Sussner Esmi 2009b, Sussner Esmi 2009a, Sussner Esmi 2011]. In this context, this work proposes a hybrid artificial neuron, which can be seen as a particular class of morphological-linear perceptrons, for dealing with binary classification problems. The proposed model, called the dilation-erosion-linear perceptron (DELP), consists of a linear combination of nonlinear morphological operators under context of lattice theory a linear operator. Also, a gradient-based method is presented to design the proposed DELP (learning process) based on ideas from the back propagation (BP) algorithm [Haykin 1998, Haykin 2007] using a systematic approach to overcome the problem of nondifferentiability of morphological operations, based on ideas from Pessoa

2 Maragos [Pessoa Maragos 2000] Sousa [de Sousa et al. 2000]. Furthermore, an experimental analysis is conducted with the proposed model using the Ripley s Synthetic [Ripley 1996] the Wisconsin Breast Cancer [Asuncion Newman 2007] classification problems. The achieved results are compared with those obtained by established techniques in the literature, where it is possible to notice that the DELP model can be used as an accurate binary classifier. This work is organized as follows. Section 2 describes the proposed DELP model. In Section 3 are presented simulations experimental results with the proposed DELP model, as well as a comparison between the obtained results those given by established techniques presented in the literature. Finally, in Section 4, it is presented the final remarks of this work. 2. The Dilation-Erosion-Linear Perceptron The proposed dilation-erosion-linear perceptron (DELP) consists of a linear combination of a nonlinear operator (dilation erosion operators) a linear operator (finite impulse response). Next we present the definition, the fundamentals the proposed training algorithm to design the DELP. Let x = (x 1, x 2,..., x n ) R n a real-valued input signal inside an n-point moving window let y the output of the DELP. Then, the DELP is defined by a hybrid morphological-linear system with local signal transformation rule x y, given by where in which y = λα + (1 λ)β, λ [0, 1], (1) β = x p T = x 1 p 1 + x 2 p x n p n, (2) α = θφ + (1 θ)ω, θ [0, 1], (3) φ = δ a (x) = ω = ε b (x) = n (x i + a i ), (4) i=1 n (x i + b i ), (5) where term n denotes the dimensionality of the input signal (x), terms λ, θ R a, b, p R n. The vector p R n represents the coefficients (weights) of the linear operator. The term β represents the output of the linear operator. The term α represents the convex combination of the morphological operators of dilation erosion (the mixture term is defined by θ). The terms φ ω represent the output of morphological operators of dilation erosion, respectively. The vectors a b represent the structuring elements (weights) of the dilation (δ a (x)) erosion (ε b (x)) operators employed into the nonlinear module of the DELP. Terms represent the supremum the infimum operations. Note that the output y is given by a convex combination of the linear operator another convex combination of morphological operators of dilation erosion (the i=1

3 mixture term is defined by λ). The main differences between + + are given by the following rules: ( ) + (+ ) = (+ ) + ( ) =, (6) ( ) + (+ ) = (+ ) + ( ) = +. (7) 2.1. Learning Process The design of the DELP model requires the adjustment of the parameters a, b, p R n λ, θ R. Therefore, the weight vector w (note that w R 3n+2 ) of the DELP model is given by w = (a, b, p, λ, θ). (8) During the proposed learning process, all parameters of the DELP model are iteratively adjusted according to an error criterium until convergence. Therefore, it is necessary to define an objective function J(w) (represented by the error between the target the model output using the weight vector w) to be minimized during the learning process, given by M J(w) = e 2 (m), (9) m=1 in which M represents the input patterns amount in the learning process e(m) represents the instantaneous error for the m-th input pattern, given by e(m) = t(m) y(m), (10) where t(m) y(m) are the target the model output, respectively. Note that the objective function builds an error surface within space R 3n+2. The main problem to minimize J(w) is to find for the optimal point in this space which minimizes the error between the target the model output, that is, to determine w in which argmin[j(w)]. In this work we propose a gradient steepest descent method using ideas of the back propagation algorithm [Haykin 1998, Haykin 2007], which is used to obtain the gradient vector to adjust the weight vector of the DELP. The learning process of the DELP model updates the weight vector w based on the steepest descent method. The adjustment of vector w for the m-th input training pattern is given by the following iterative formula: w(i + 1) = w(i) µ J(w), (11) where µ > 0 (usually called step size or learning rate) i {1, 2,...}. The term µ is responsible for regulating the tradeoff between stability speed of convergence of the iterative procedure. The iteration of Equation 11 starts with an initial guess w(0) stops when some desired condition is reached. Term J(w) is the gradient, which is given by J(w) = J ( J w = a, J b, J p, J λ, J ), (12) θ

4 in which J a = 2e(m) a, (13) J = 2e(m) b b, (14) J = 2e(m) p p, (15) J = 2e(m) λ λ, (16) J θ = 2e(m) θ. (17) Note that the existence of the gradient of J with respect to w depends on the existence of the gradients,,,. Next we present the formulas to calculate a b p λ θ them. in which The term is given by λ The term is given by p λ = α β. (18) p = β β p, (19) β β p where x represents the input signal (m-th input training pattern). in which The term θ is given by = 1 λ, (20) = x, (21) θ = θ, (22) θ = λ, (23) = φ ω. (24) Terms are estimated using the concept of smoothed rank indicator vector [Pessoa Maragos 2000, Sousa 2000, de Sousa et al. 2000] (because dilation a b erosion operators can be seen as particular cases of the rank function), where we choose the smoothed unit sample function Q σ (x) = [q σ (x 1 ), q σ (x 2 ),..., q σ (x n )], in which ( x ) q σ (x i ) = sech 2, i = 1,..., n. (25) σ

5 Note that the choice of the scale factor σ directly affect the estimation interpolation of the gradients. However, the learning process of the DELP model a b even works with σ 0, since in this particular case, the gradient will be given in terms of the usual rank indicator vector [Pessoa Maragos 2000, Sousa 2000, de Sousa et al. 2000]. in which in which Therefore, the term is given by a a = φ a = φ φ As the same way, the term is given by b b = ω b = φ a = λ φ φ a, (26) = θ, (27) Q σ (φ 1 (x + a)) Q σ (φ 1 (x + a)) 1 T. (28) ω ω 3. Simulations Experimental Results ω b = λ ω ω b, (29) = 1 θ, (30) Q σ (ω 1 (x + b)) Q σ (ω 1 (x + b)) 1 T. (31) The well-known Ripley s synthetic Wisconsin breast cancer classification problems were used as a test bed for the evaluation of the proposed model. To assess the classification performance we use the percentage of misclassified patterns (PMP) [Sussner Esmi 2011] metric. Also, we use the percentage gain (PG) metric, in terms of the PMP obtained using DELP model using other models investigated in this work, which is given by P G = P MP delp, (32) P MP model where P MP delp represents the PMP obtained using the proposed DELP model P MP model represents the PMP obtained using the investigated model. It is worth mentioning that the data was normalized to lie within the range [0, 1] according to Prechelt [Prechelt 1994]. The entries of the DELP weight vectors a, b p are romly initialized within the range [ 1, 1]. The initial DELP mixture coefficients λ θ are romly chosen in the interval [0, 1]. Based on exhaustive experiments to determine the best learning rate (µ) the scale factor (σ), we use µ = 0.01 σ = 1.5. It is worth mentioning that three stop conditions are used into the learning process : i) The maximum epoch number equals to 10 4 ; ii) The decrease in the training error process training (P t) of the cost function equals to 10 6.

6 In order to establish a fair performance comparison, results with the following classification models were examined in the same context under the same experimental conditions: multilayer perceptrons (MLP) [Haykin 1998, Haykin 2007], morphologicalrank-linear neural network (MRLNN) [Pessoa Maragos 2000], morphological perceptron with competitive learning (MP/CL) [Sussner Esmi 2011], single layer morphological perceptron (SLMP) [Sussner 1998b], fuzzy lattice neural network (FLNN) [Petridis Kaburlasos 1998], fuzzy lattice reasoning (FLR) [Kaburlasos et al. 2007], k-nearest neighbors (KNN) [Devroye et al. 1996], decision tree (DT) [Breiman et al. 1984, Esposito et al. 1997] support vector machine (SVM) [Haykin 2007]. In all experiments we used the MLP model with sigmoidal processing units a single hidden layer. For its learning process we used the Levenberg-Marquardt [Hagan Menhaj 1994] algorithm using the following stopping criteria [Prechelt 1994]: i) The maximum epoch number equals to 10 4 ; ii) The decrease in the training error process training (P t) of the cost function equals to Also, for the MRLNN model we used the same parameters suggested by [Pessoa Maragos 2000] with a single hidden layer, for its learning process we used the generalized back propagation (GBP) [Pessoa Maragos 2000] algorithm with learning rate equals to 0.01, scale factor equals to using the same stopping criteria of the MLP model. It is worth mentioning that for both MLP MRLNN models, we applied the 10-fold cross validation to determine the number of hidden processing units (5, 10, 15, 20, 25 or 50). For the MP/CL model we used the same design process parameters definition suggested by [Sussner Esmi 2011]. For the SLMP model we used the same design process parameters definition suggested by [Sussner 1998b]. For the FLNN model we used the same design process parameters definition suggested by [Petridis Kaburlasos 1998, Sussner Esmi 2011]. For the FLR model we used the same design process parameters definition suggested by [Kaburlasos et al. 2007, Sussner Esmi 2011]. For the KNN model we used the 10-fold cross-validation to determine the best value of k (1,2,...,20) in terms of the mean error rate on the validation set, as suggested by [Sussner Esmi 2011]. For the DT model we used the criterion for choosing a split by Gini s diversity index, as suggested by [Sussner Esmi 2011]. At the end, for the SVM model we used linear (SVM-L), polynomial (SVM-P), quadratic (SVM-Q) rbf (SVM-RBF) kernels with the least squares method to find the separating hyperplane, as defined in [Haykin 2007] Ripley s Synthetic Problem The Ripley s synthetic problem [Ripley 1996] consists of samples from two classes. Each sample has 2-dimensional features vector. The data are divided into training test sets. The training set consists of 250 samples, while the test set consists of 1000 samples. It is worth mentioning that, for both training test sets, we have the same number of samples belonging to each of the two classes, characterizing a balanced binary classification problem in R 2. The Table 1 presents the experimental results of the test set obtained by the models presented in literature, as well as those achieved by the proposed DELP model. According to Table 1, it is possible to notice that the best model found in the literature is the SVM-RBF (with P MP = 8.30%). However, a slightly inferior classification performance can be achieved using SVM-P, MLP, SVM-Q, MRLNN, KNN MP/CL models. It is worth mentioning that the proposed DELP model obtained good classifica-

7 Table 1. Percentage of misclassified patterns of the test set for the Ripley s synthetic problem. Model PMP (%) MLP 9.30 MRLNN 9.50 MP/CL SLMP FLNN FLR KNN 9.60 DT SVM-L SVM-P 9.10 SVM-Q 9.40 SVM-RBF 8.30 DELP 8.30 tion performance, having the same PMP value obtained by the best model found in the literature. The Table 2 presents the PG (test set), in terms of the PMP obtained using DELP model using other models investigated in this work. Table 2. Percentage gain (test set) of the proposed DELP model regarding MLP, MRLNN, MP/CL, SLMP, FLNN, FLR, KNN, DT, SVM-L, SVM-P, SVM-Q SVM-RBF models for the Ripley s synthetic problem. PG (%) DELP / MLP DELP / MRLNN DELP / MP/CL DELP / SLMP DELP / FLNN DELP / FLR DELP / KNN DELP / DT DELP / SVM-L DELP / SVM-P 8.79 DELP / SVM-Q DELP / SVM-RBF 0.00 According to the Table 2, without relying on the results obtained with SVM-RBF (where the proposed DELP model achieved the same classification performance), it is possible to notice that the proposed DELP model obtained improvement greater than 8% over the results achieved using MLP, MRLNN, MP/CL, SLMP, FLNN, FLR, KNN, DT, SVM-L, SVM-P SVM-Q models. The decision surface generated by the proposed DELP model for the Ripley s synthetic problem is depicted in Figure 1.

8 Figure 1. Decision surface produced by the proposed DELP model for the Ripley s synthetic problem Feature Feature Wisconsin Breast Cancer Problem The Wisconsin breast cancer problem [Asuncion Newman 2007] consists of samples from two classes representing malignant benignant breast cancer. The data are divided into training test set, where we used the same partitioning scheme suggested by [Sussner Esmi 2011] (the first 249 samples of the benignant class the first 148 samples of the malignant class are used in the training set the rest of the samples from both classes are used in the test set). Each sample has 30-dimensional features vector. The Table 3 presents the experimental results of the test set obtained by the models presented in literature, as well as those achieved by the proposed DELP model. According to Table 3, we can verify that the best models found in the literature are the SVM-L SVM-Q (having the same P MP = 1.75%). However, a slightly inferior classification performance can be found using MRLNN, SVM-RBF, FLR, MP/CL MLP. It is possible to notice that the proposed DELP model obtained good classification performance (with P M P = 1.40%), overcoming the best models found in the literature. The Table 4 presents the PG (test set), in terms of the PMP obtained using DELP model regarding the PMP obtained using other models investigated in this work. According to the Table 4, we can see that the proposed DELP model obtained improvement greater than 20% over the results achieved using MLP, MRLNN, MP/CL, SLMP, FLNN, FLR, KNN, DT, SVM-L, SVM-Q, SVM-P SVM-RBF models. 4. Conclusion In this paper, a hybrid artificial neuron was presented for dealing with synthetic realworld binary classification problems. The proposed model, called the dilation-erosionlinear perceptron (DELP), consists of a linear combination of nonlinear morphological

9 Table 3. Percentage of misclassified patterns of the test set for the Wisconsin breast cancer problem. Model PMP (%) MLP 4.55 MRLNN 2.10 MP/CL 4.20 SLMP FLNN 5.59 FLR 3.50 KNN 5.94 DT 8.74 SVM-L 1.75 SVM-P SVM-Q 1.75 SVM-RBF 3.15 DELP 1.40 Table 4. Percentage gain (test set) of the proposed DELP model regarding MLP, MRLNN, MP/CL, SLMP, FLNN, FLR, KNN, DT, SVM-L, SVM-P, SVM-Q SVM-RBF models for the Wisconsin breast cancer problem. PG (%) DELP / MLP DELP / MRLNN DELP / MP/CL DELP / SLMP DELP / FLNN DELP / FLR DELP / KNN DELP / DT DELP / SVM-L DELP / SVM-P DELP / SVM-Q DELP / SVM-RBF operators under context of lattice theory a linear operator. For the DELP design (learning process) we presented a gradient steepest descent method based on ideas from the back propagation (BP) algorithm using a systematic approach to overcome the problem of nondifferentiability of morphological operations. The classification performance of the proposed DELP model was assessed in terms of well-known models presented in the literature (MLP, MRLNN, MP/CL, SLMP, FLNN, FLR, KNN, DT, SVM-L, SVM-P, SVM-Q SVM-RBF) using the percentage misclassified patterns metric. Besides, two relevant binary classification problem were investigated in this work: Ripley s Synthetic Wisconsin Breast Cancer. The experimental results demonstrated similar performance (for the Ripley s problem) better performance (for the Wisconsin Breast Cancer problem) of the proposed DELP model in

10 comparison to the better models found in the literature. In other words, the proposed DELP model succeeded to solve the aforementioned classification problems, exhibiting very satisfactory classification results. Further studies must be developed to better formalize explain the properties of the proposed DELP model to determine its possible limitations with other binary classification problems. Further studies, in terms of convergence analysis, must be done in the learning process of the DELP model. Finally, a particular study about the computing complexity CPU time of the proposed DELP model must be done in order to establish a complete cost-performance evaluation of the proposed model. According to this investigation, it will be possible to relate, in terms of cost, the necessary time to generate an optimal model. References Asuncion, A. Newman, D. J. (2007). UCI machine learning repository. Breiman, L., Friedman, J., Olshen, R., Stone, C. (1984). Classification Regression Trees. Wadsworth Brooks, Monterey, CA. de A. Araújo, R., Madeiro, F., de Sousa, R. P., Pessoa, L. F. C. (2006a). Modular morphological neural network training via adaptive genetic algorithm for designing translation invariant operators. In Proceedings of the IEEE International Conference on Acoustics, Speech Signal Processing, volume 2, pages de A. Araújo, R., Madeiro, F., de Sousa, R. P., Pessoa, L. F. C., Ferreira, T. A. E. (2006b). An evolutionary morphological approach for financial time series forecasting. In Proceedings of the IEEE Congress on Evolutionary Computation, pages de A. Araújo, R., Madeiro, F., Ferreira, T. A. F., de Sousa, R. P., Pessoa, L. F. C. (2006c). Improved evolutionary hybrid method for designing morphological operators. In Proceedings of the IEEE International Conference on Image Processing, pages de Sousa, R. P., Carvalho, J. M., Assis, F. M., Pessoa, L. F. C. (2000). Designing translation invariant operations via neural network training. In Proc. of the IEEE Intl Conference on Image Processing, Vancouver, Canada. Devroye, L., Gyorfi, L., Lugosi, G. (1996). A probabilistic theory of pattern recognition. Springer. Esposito, F., Malerba, D., Semeraro, G. (1997). A comparative analysis of methods for pruning decision trees. IEEE Trans. Pattern Anal. Mach. Intell., 19(5): Hagan, M. Menhaj, M. (1994). Training feedforward networks with the marquardt algorithm. IEEE Transactions on Neural Networks, 5(6): Haykin, S. (1998). Neural networks: A comprehensive foundation. Prentice Hall, New Jersey. Haykin, S. (2007). Neural Networks Learning Machines. McMaster University, Canada. Heijmans, H. J. A. M. (1994). Morphological Image Operators. Academic Press, New York, NY.

11 Hocaoglu, A. K. Gader, P. D. (2003). Domain learning using choquet integralbased morphological shared weight neural networks. Image Vision Computing, 21(7): Kaburlasos, V. G., Athanasiadis, I. N., Mitkas, P. A. (2007). Fuzzy lattice reasoning (flr) classifier its application for ambient ozone estimation. Int. J. Approx. Reasoning, 45(1): Kaburlasos, V. G. Petridis, V. (2000). Fuzzy lattice neurocomputing (FLN) models. Neural Networks, 13(10): Khabou, M. A. Gader, P. D. (2000). Automatic target detection using entropy optimized shared-weight neural networks. IEEE Transactions on Neural Networks, 11(1): Maragos, P. (1989). A representation theory for morphological image signal processing. IEEE Transactions on Pattern Analysis Machine Intelligence, 11: Pessoa, L. F. C. Maragos, P. (2000). Neural networks with hybrid morphological rank linear nodes: a unifying framework with applications to hwritten character recognition. Pattern Recognition, 33: Petridis, V. Kaburlasos, V. G. (1998). Fuzzy lattice neural network (FLNN): a hybrid model for learning. IEEE Transactions on Neural Networks, 9(5): Prechelt, L. (1994). Proben1: A set of neural network benchmark problems benchmarking rules. Technical Report 21/94. Ripley, B. D. (1996). Pattern Recognition Neural Networks. Cambridge University Press, Cambridge, United Kingdom. Ritter, G. X., Sussner, P., de Leon, J. L. D. (1998). Morphological associative memories. IEEE Transactions on Neural Networks, 9(2): Ritter, G. X., Sussner, P., Hacker, W. B. (1997). Associative memories with infinite storage capacity. In InterSymp 97, 9th International Conference on Systems Research Informatics Cybernetics, pages , Baden-Baden, Germany. Invited Plenary Paper. Ritter, G. X. Urcid, G. (2003). Lattice algebra approach to single-neuron computation. IEEE Transactions on Neural Network, 14(2): Silva, A. M. Sussner, P. (2008). A brief review comparison of feedforward morphological neural networks with applications to classification. In Proceedings of the International Conference on Artificial Neural Networks, pages Sousa, R. P. (2000). Design of translation invariant operators via neural network training. PhD thesis, UFPB, Campina Gre, Brazil. Sussner, P. (1998a). Kernels for morphological associative memories. In Proceedings of the International ICSA/IFAC Symposium on Neural Computation, pages 79 85, Vienna. Sussner, P. (1998b). Morphological perceptron learning. In Proceedings of the IEEE International Symposium on Intelligent Control, pages , Gaithersburg, MD.

12 Sussner, P. Esmi, E. L. (2009a). Constructive morphological neural networks: some theoretical aspects experimental results in classification. In Kacprzyk, J., editor, Constructive Neural Networks, Studies in Computational Intelligence, pages Springer Verlag, Heidelberg, Germany. Sussner, P. Esmi, E. L. (2009b). An introduction to morphological perceptrons with competitive learning. In Proceedings of the International Joint Conference on Neural Networks, pages , Atlanta, GA. Sussner, P. Esmi, E. L. (2011). Morphological perceptrons with competitive learning: Lattice-theoretical framework constructive learning algorithm. Information Sciences, 181(10): Sussner, P. Valle, M. E. (2006a). Grayscale morphological associative memories. IEEE Transactions on Neural Networks, 17(3): Sussner, P. Valle, M. E. (2006b). Implicative fuzzy associative memories. IEEE Transactions on Fuzzy Systems, 14(6): Sussner, P. Valle, M. E. (2007). Morphological certain fuzzy morphological associative memories for classification prediction. In Kaburlassos, V. G. Ritter, G. X., editors, Computational Intelligence Based on Lattice Theory, volume 67, pages Springer Verlag, Heidelberg, Germany. Valle, M. E. Sussner, P. (2008). A general framework for fuzzy morphological associative memories. Fuzzy Sets Systems, 159(7):

Neural Networks and the Back-propagation Algorithm

Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely