A Simple Implementation of the Stochastic Discrimination for Pattern Recognition

Similar documents
A CONCRETE STATISTICAL REALIZATION OF KLEINBERG S STOCHASTIC DISCRIMINATION FOR PATTERN RECOGNITION. PART I. TWO-CLASS CLASSIFICATION

Bagging and Boosting for the Nearest Mean Classifier: Effects of Sample Size on Diversity and Accuracy

Iterative Laplacian Score for Feature Selection

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

ECE521 week 3: 23/26 January 2017

Applied Statistics. Multivariate Analysis - part II. Troels C. Petersen (NBI) Statistics is merely a quantization of common sense 1

Overriding the Experts: A Stacking Method For Combining Marginal Classifiers

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Discriminative Direction for Kernel Classifiers

Lecture Notes 1: Vector spaces

Small sample size generalization

Active Sonar Target Classification Using Classifier Ensembles

Dyadic Classification Trees via Structural Risk Minimization

STUDY ON METHODS FOR COMPUTER-AIDED TOOTH SHADE DETERMINATION

Neural Networks and the Back-propagation Algorithm

INVARIANT SUBSETS OF THE SEARCH SPACE AND THE UNIVERSALITY OF A GENERALIZED GENETIC ALGORITHM

Machine Learning and Deep Learning! Vincent Lepetit!

Classification on Pairwise Proximity Data

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

NEAREST NEIGHBOR CLASSIFICATION WITH IMPROVED WEIGHTED DISSIMILARITY MEASURE

Adaptive Boosting of Neural Networks for Character Recognition

EVALUATING MISCLASSIFICATION PROBABILITY USING EMPIRICAL RISK 1. Victor Nedel ko

Algorithm-Independent Learning Issues

A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier

Data Mining und Maschinelles Lernen

TDT4173 Machine Learning

Invariant Scattering Convolution Networks

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier

Study on Classification Methods Based on Three Different Learning Criteria. Jae Kyu Suhr

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Classifier s Complexity Control while Training Multilayer Perceptrons

Lossless Online Bayesian Bagging

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Machine Learning 11. week

FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION

Generative MaxEnt Learning for Multiclass Classification

Introduction to Machine Learning Spring 2018 Note 18

Discriminant analysis and supervised classification

Optimizing Abstaining Classifiers using ROC Analysis. Tadek Pietraszek / 'tʌ dek pɪe 'trʌ ʃek / ICML 2005 August 9, 2005

Statistical Rock Physics

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

STATE GENERALIZATION WITH SUPPORT VECTOR MACHINES IN REINFORCEMENT LEARNING. Ryo Goto, Toshihiro Matsui and Hiroshi Matsuo

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning Overview

12 Discriminant Analysis

Learning Kernel Parameters by using Class Separability Measure

Fuzzy Cognitive Maps Learning through Swarm Intelligence

Performance Evaluation and Comparison

Motivating the Covariance Matrix

On optimal reject rules and ROC curves

Holdout and Cross-Validation Methods Overfitting Avoidance

Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine

Improved TBL algorithm for learning context-free grammar

Selection of Classifiers based on Multiple Classifier Behaviour

ROBUST LINEAR DISCRIMINANT ANALYSIS WITH A LAPLACIAN ASSUMPTION ON PROJECTION DISTRIBUTION

Advanced metric adaptation in Generalized LVQ for classification of mass spectrometry data

On Improving the k-means Algorithm to Classify Unclassified Patterns

CS 195-5: Machine Learning Problem Set 1

Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

FEATURE SELECTION COMBINED WITH RANDOM SUBSPACE ENSEMBLE FOR GENE EXPRESSION BASED DIAGNOSIS OF MALIGNANCIES

Topic 3: Neural Networks

Mining Classification Knowledge

Artificial neural networks

CSCI-567: Machine Learning (Spring 2019)

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

COMPARING PERFORMANCE OF NEURAL NETWORKS RECOGNIZING MACHINE GENERATED CHARACTERS

Classifier Complexity and Support Vector Classifiers

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Nonparametric Bayesian Methods (Gaussian Processes)

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION

A Posteriori Corrections to Classification Methods.

Brief Introduction to Machine Learning

Dynamic Linear Combination of Two-Class Classifiers

Sparse representation classification and positive L1 minimization

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

THE ADVICEPTRON: GIVING ADVICE TO THE PERCEPTRON

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine Learning Lecture 5

COMS 4771 Introduction to Machine Learning. Nakul Verma

EEE 241: Linear Systems

8. Classification and Pattern Recognition

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Machine Learning Basics

Effect of Rule Weights in Fuzzy Rule-Based Classification Systems

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Lecture: Face Recognition

W vs. QCD Jet Tagging at the Large Hadron Collider

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Support Vector Machine Classification via Parameterless Robust Linear Programming

Subspace Classifiers. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Efficient Approach to Pattern Recognition Based on Minimization of Misclassification Probability

Chapter 6 Classification and Prediction (2)

Transcription:

A Simple Implementation of the Stochastic Discrimination for Pattern Recognition Dechang Chen 1 and Xiuzhen Cheng 2 1 University of Wisconsin Green Bay, Green Bay, WI 54311, USA chend@uwgb.edu 2 University of Minnesota, Minneapolis, MN 55455, USA cheng@cs.umn.edu Abstract. The method of stochastic discrimination (SD) introduced by Kleinberg ([6,7])is a new method in pattern recognition. It works by producing weak classifiers and then combining them via the Central Limit Theorem to form a strong classifier. SD is overtraining-resistant, has a high convergence rate, and can work quite well in practice. However, some strict assumptions involved in SD and the difficulties in understanding SD have limited its practical use. In this paper, we present a simple algorithm of SD for two-class pattern recognition. We illustrate the algorithm by applications in classifying the feature vectors from some real and simulated data sets. The experimental results show that SD is fast, effective, and applicable. 1 Introduction Suppose that certain objects are to be classified as coming from one of two classes, say class 1 and class 2. A fixed number of measurements made on each object form a feature vector q. All the feature vectors constitute a finite feature space F R p. We can classify an object after observing its feature vector q with the aid of the classification rule of SD and a training set TR = {TR 1,TR 2 }, where TR i is a given random sample from class i. On an intuitive level, the idea of SD is similar to how people learn: People learn new knowledge and strategies step by step. After years of learning, the knowledge and strategies (or skills) that they have accumulated will enable them to tackle complicated tasks. On a precise mathematical level, the procedure of SD is outlined as follows. Step 1. Use the training set and rectangular regions to produce t weak classifiers S (1), S (2),..., S (t), where t is a natural number. See Section 2. Step 2. For any given feature vector q from F, calculate the average Y (q, S t )= X(q, S(1) )+X(q, S (2) )+...+ X(q, S (t) ) t, (1) where X(, ) is a base random variable defined later in Section 3. Step 3. Set a level t classification rule as follows: if Y (q, S t ) 1/2, classify q into class 1; otherwise classify q into class 2. See Section 4. F.J. Ferri et al. (Eds.): SSPR&SPR 2000, LNCS 1876, pp. 882 887, 2000. c Springer-Verlag Berlin Heidelberg 2000

Stochastic Discrimination for Pattern Recognition 883 The above procedure follows the idea in [7]. We will study these steps in Sections 2 4. SD is characterized by the properties of overtraining-resistance, a high convergence rate, and a low misclassification error rate (see [2] and [7]). This will be shown by examples in Section 5. The underlying ideas behind SD were introduced in [6]. Since then, a fair amount of research has been carried out on this method, and on variations of its implementation. See, for example, [1], [2], [3], [4], [7] and [8]. And the results have convincingly shown that stochastic discrimination is a promising area in pattern recognition. 2 How to Produce Weak Classifiers We produce weak classifiers through resampling. In fact, a weak classifier is s finite union of rectangular regions which satisfies some coverage condition. In this context, a rectangular region in R p is a region of the form { (x 1,x 2,...,x p ) a i x i b i, for i =1, 2,...,p}, (2) where a i and b i are real numbers for i =1, 2,...,p. Let R be an appropriate rectangular region in R p which contains F. We will utilize those rectangular regions in (2) whose width b i a i along the x i -axis is at least ρ times the corresponding width of R, where 0 <ρ<1 is a fixed constant. The coverage condition is related to a ratio variable r. For any subsets T 1 and T 2 of F, let r(t 1,T 2 ) denote the ratio of the number of common feature vectors in T 1 and T 2 and the number of feature vectors in T 2. For example, if T 2 contains 5 feature vectors and T 1 and T 2 have 3 feature vectors in common, then r(t 1,T 2 )=3/5 = 0.6. It is seen that r(t 1,T 2 ) represents the coverage of the points in T 2 by T 1. Now we can define a weak classifier. Roughly speaking, a weak classifier is a finite union of rectangular regions such that the coverage of the points in TR 1 by the union and the coverage of the points in TR 2 by the union are different. Strictly speaking, let β be a fixed real numbers with 0 <β<1, then an S is said to be a weak classifier if S is a union of at most κ rectangular regions in R p which satisfies r(s, T R 1 ) r(s, T R 2 ) β. The condition r(s, T R 1 ) r(str 2 ) simply states that S can actually be used as a (very weak) classifier. To illustrate this, consider an S, which contains 40 sample points from TR 1 and 60 from TR 2. Then r(s, T R 1 )=40/n 1 and r(s, T R 2 )=60/n 2. Suppose 40/n 1 > 60/n 2. That is, the coverage of the points in TR 1 by S is greater than the coverage of the points in TR 2 by S. Let S c denote the complement of S in F. Then since 1 40/n 1 < 1 60/n 2, one sees that the coverage of the points in TR 2 by S c is greater than the coverage of the points in TR 1 by S c. Thus intuitively we could use S to classify F by deciding that any sample point q from S belongs to class 1 and any other sample point q from S c belongs to class 2. This of course gives a (very) weak classifier.

884 D. Chen and X. Cheng 3 Base Random Variables To connect weak classifiers with feature vectors, we need a base random variable X(, ). Given a feature vector q and a weak classifier S, the value of X is defined to be X(q, S) = 1 S(q) r(s, T R 2 ) r(s, T R 1 ) r(s, T R 2 ), (3) where 1 S (q) =1ifq is contained in S and 0 otherwise. X(, ) in (3) can be understood as the standardized version of 1 S (q), which gives the simplest way to connect the weak classifiers S and feature vectors q. 4 Classification Rule Let S t =(S (1),S (2),...,S (t) ) be a random sample of t weak classifiers. For any q F, define Y (q, S t ) as stated in (1). Under some mild conditions, the Central Limit Theorem can be used to show the following fact. If t is large enough then there is a high probability that Y (q, S t ) is close to 1 for any q from TR 1 and close to 0 for any q from TR 2 (see Theorem 1 in [2].). Hence one can define the following Level t Stochastic Discriminant Classification Rule: For any q F, if Y (q, S t ) 1/2, classify q into class 1, otherwise classify q into class 2. 5 Experimental Analysis In this section, we report the experimental results on classifying feature vectors from several problems. The emphasis will be placed on normal populations. The comparison of SD with non-sd methods is also given. Example 1 (Two normal populations with equal covariance matrix ). Consider two distributions N(µ 1, I) (class 1) and N(µ 2, I) (class 2), where I is the 2 2 identity matrix, µ 1 is the vector (1.5, 0), and µ 2 is the vector (0, 0). Both of the prior class probabilities π 1 (for class 1) and π 2 (for class 2) are equal to 1/2. The training set contains 400 points from each class, and test set contains 1000 points from each class. Let R 1 be the smallest rectangular region which contains both training and test data. Suppose λ 1. Let R λ denote the rectangular region similar to R 1 whose center is the same as that of R 1 and whose width along the x i -axis is λ times the corresponding width of R 1.We regard R λ as our R defined in Sect. 2. For the resampling process, λ =1,ρ =0.3, β =0.52, and κ =5. The test error from SD is below 23.55% when the level t 4700. See Fig. 1 for the performance of SD. From the figure, we see that both training and test errors start to decrease at the beginning and then quickly level off, forming two parallel curves, as more weak classifiers are added. This phenomenon is common for SD classification procedures. As a comparison, the linear discriminant rule yields a test error 23.15%.

Stochastic Discrimination for Pattern Recognition 885 Error 0.20 0.22 0.24 0.26 training test 0 2000 4000 6000 8000 10000 Level t Fig. 1. Training and test errors for the classification with two normal distributions having the same covariance matrix. The training set contains 400 points from each class, and the test set contains 1000 points from each class. Example 2 (Classifying Alaskan and Canadian salmon). Here we will show the performance of SD on the salmon dataset from Johnson and Wichern ([5]). The original data contain the information of gender, diameter of rings for the firstyear freshwater growth, and diameter of rings for the first-year marine growth for 50 Alaskan salmon and 50 Canadian salmon. We treat both freshwater and marine growth ring diameters as the features of salmon. Johnson and Wichern ([5]) note that the data appears to satisfy the assumption of bivariate normal distributions, but the covariance matrices may differ. Thus the usual quadratic classification rule may be used to classify the salmon. Using the quadratic rule and equal prior probabilities, the error rate from a 10-fold cross-validation is then 8%. To apply SD, we set (λ, ρ, β, κ) =(1.005, 0.1, 0.5, 5). From the same data sets as those used with the quadratic rule, the 10-fold CV error rate from SD is virtually 9%. See Fig. 2 for the details. Other comparisons of SD with non-sd methods are also available. For example, [2] considers the classification for the Pima Indians diabetes dataset described in [9]. The test error from SD is actually identical to the best result in [9]. Section 3 of [7] reports one experiment on handwritten digit recognition and another experiment on classifying boundary types in DNA sequences. In the first experiment, SD is compared with a nearest neighbor algorithm, a k-nearest neighbor algorithm, and a neural network. In the second experiment, SD is com-

886 D. Chen and X. Cheng Error 0.05 0.10 0.15 0.20 0.25 0.30 training test 0 2000 4000 6000 8000 10000 Level t Fig. 2. The averaged training and test errors for classifying Alaskan and Canadian salmon. The errors are based on a 10-fold cross-validation procedure. pared with more than 20 different methods. In both cases comparisons show that SD yields the best test set performance. Notes. When applying SD to a dataset, we need the values of λ, ρ, κ, and β. From Sect. 2, we know that these 4 parameters together determine weak classifiers. Since the quantitative relationship among these parameters is not available, we can apply SD to the training set alone to find out the combination of these parameters with which the misclassification rate for TR is minimum. Thus, we propose the following two step procedure. First, we run SD over TR by stepping through the range of these parameters and find out the combinations corresponding to the best achieved TR performance. Since SD has an exponential convergence rate ([2]), this step is practical. In fact, usually we can obtain several satisfactory combinations and we choose the one with which SD runs fastest. Then, we apply SD to both training and test sets with the selected parameters. Acknowledgments. The first author is very grateful to his advisor Professor Eugene Kleinberg for providing an introduction to the field of pattern recognition and sharing with the author his own research.

Stochastic Discrimination for Pattern Recognition 887 References 1. Berlind, R.: An Alternative Method of Stochastic Discrimination with Applications to Pattern Recognition. Doctoral Dissertation, Dept. of Mathematics, State University of New York at Buffalo (1994) 2. Chen, D., Huang, P., Cheng, X.: A Concrete Statistical Realization of Kleinberg s Stochastic Discrimination for Pattern Recognition, Part I. Two - Class Classification. Submitted for Publication 3. Ho, T. K.: Random Decision Forests. In: Kavanaugh, M., Storms, P. (eds.): Proceedings of the Third International Conference on Document Analysis and Recognition. IEEE Computer Society Press, New York (1995) 278-282 4. Ho, T. K.: The Random Subspace Method for Constructing Decision Forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1998) 832-844 5. Johnson, R. A., Wichern, D. W.: Applied Multivariate Statistical Analysis. 4th edn. Prentice Hall (1998) 6. Kleinberg, E. M.: Stochastic Discrimination. Annals of Mathematics and Artificial Intelligence 1 (1990) 207-239 7. Kleinberg, E. M.: An Overtraining-Resistant Stochastic Modeling Method for Pattern Recognition. Annals of Statistics 24 (1996) 2319-2349 8. Kleinberg, E. M., Ho, T. K.: Building Projectable Classifiers of Arbitrary Complexity. In: Kavanaugh, M. E., Werner, B. (eds.): Proceedings of the 13th International Conference on Pattern Recognition. IEEE Computer Society Press, New York (1996) 880-885 9. Ripley, B. D.: Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge (1996)