THE PERFORMANCE OF CLASSIFICATION RULES PRODUCED BY SYMBOLIC INDUCTIVE METHODS

Size: px

Start display at page:

Download "THE PERFORMANCE OF CLASSIFICATION RULES PRODUCED BY SYMBOLIC INDUCTIVE METHODS"

Shannon Byrd
6 years ago
Views:

1 Six:h inlt!rnajiona/ Symposium on.~fe/hod%gies jor ifllelligefict! SYSlt!ms Char/oue,.vC.. OClober 16 19, 1991 ~/-t5 '7/-~ USI~G GE~ETIC ALGORITHi\1S TO IMPROVE THE PERFORMANCE OF CLASSIFICATION RULES PRODUCED BY SYMBOLIC INDUCTIVE METHODS Jerzy Bala Kenneth Dejong Peter Pachowicz Center for Artificial Intelligence, George Mason University 4400 University Drive, Fairfax, VA Abstract: In this paper we present a novel way 0/ combining symbolic inductive methods and genetic algorithms (GAs) applied to produce high-performance classification rules. The presented method consists 0/ two phases. In the first one the algorithm induvtively learns a set of classification rules from noisy input examples. In the second phase the worst performing rule is optimized by GAs techniques. Experimental results are presented/or twelve classes o/noisy data obtained/rom textured images. 1 Introduction One fundamental weakness of inductive learning (except for special cases) is the fact that the acquired knowledge cannot be validated. Traditional inquiries into inductive inference have therefore dealt with questions of what are the best criteria for guiding the selection of inductive assertions, and how these assertiops can be confirmed. The goal of inference is to formulate plausible general assertions that explain the given facts and are able to predict new facts. For a given set of facts, a potentially infinite number of hypotheses can be generated that imply these facts. A preference criterion is therefore necessary to provide constraints and reduce the infinite choice to one hypothesis or a few more preferable ones. A typical way of defining such criterion is to specify the preferable propenies of the hypothesis, for example, to require that the hypothesis is the shonest or the most economical description consistent with all the facts. Even if the preferred criterion is defined it can generate other problems: e.g. performance on the future data. This is especially imponant if the initial data has some distribution of noisy and irrelevant examples. We propose a novel hybrid approach in which, frrst, a classic rule induction system (AQ) produces shon, simple, complete and, consistent rules; and then GAs are used to improve the performance of the rule via an evolutionary mechanism. By dropping the requirement for a rule

2 to be complete and consistent with respect to the initial learning dara, we reduce the coverage of noisy and irrelevant components in the rule structure. Since the AQ learning algorithm is the induction learning method in our method the next section presents a short overview of this learning method. Section 3 presents the GA phase of our method and the experimental results are presented in section 4. 2 AQ learning algorithm The AQ algorithm [Michalski et al., 1986] learns atuibutional description from examples. When building a decision rule. AQ performs a heuristic search through a space of logical expressions to determine those that account for all positive examples and no negative examples. Because there are usually many such complete and consistent expressions, the goal of AQ is to find the most preferred one, according to some criterion. Learning examples are given in the form of events. which are vectors of attribute values. Events represent different decision classes. For each class a decision rule is produced that covers all examples of that class and none of other classes. The concept descriptions learned by AQ algorithm are represented in VL 1, the Variable Valued Logic System 1 [Michalski. 1972]. and are used to represent attributional concept descriptions. A description of a concept is a disjunctive normal form which is called a cover. A cover is a disjunction of complexes. A complex is a conjunction of selectors. A selector is a form: [L#R] (1) where. L is called the referee. which is an attribute. R is called the referent, which is a set of values in the domain of the attribute L. # is one of the following relational symbols: =, <, >. >=, <=, <>. The following is an example of the five selectors AQ complex (equality is used as a relational symbol): [xl=1.. 3][x2=1][x4=O][x6=1..7][x8=1] Dots in the description represent a range of possible values of a given attribute. An example of simple, four complexes. AQ description is presented below: 1 2 [xl=7..8] [x2=8.. 19] [x ] [x5=4..54] [x6=o..3] [xl= ] [x3=11..14] [x4= ] [x6=o.. 9] [x7=o [xl=9.. 18] [x3=16..21] [x4=9.. 10] [xl=1o.. 14] [x3= ] [x4= ] [x7=4.. 5] 2

$These properties mean that each class covers its learning examples (positive examples) and does not cover any negative examples (negative examples are all examples belonging to other classes), \\'hen$

3 3 Genetic Algorithm's Phase The rules generated by the AQ algorithm have the propeny of being complete and consistent with respect to the learning examples. These properties mean that each class covers its learning examples (positive examples) and does not cover any negative examples (negative examples are all examples belonging to other classes), \\'hen the AQ algorithm is applied to finergrained and noisy symbolic representations the consistency and completeness conditions overconstrain the generated class description. This, in turn, leads to poorer predictive accuracy on future data. Conversely, by relaxing the requirement of consistency and completeness, predictive accuracy in noisy domains can be improved. In the proposed method we use GAs to improve the performance of initially consistent and complete rules via the evolutionary mechanism. A genetic algorithm [De Jong. 1988] maintains a constant-sized population of candidate solutions. known as individuals. At each iteration, each individual is evaluated and recombined with others on the basis of its overall quality or fitness. The expected number of times an individual is selected for recombination is proportional to its fitness relative to the rest of population. The power of a genetic algorithm lies in its ability to exploit, in a highly efficient manner, information about a large number of individuals. By allocating more reproductive occurrences to above average individuals, the overall affect is an increase of the population's average fitness. New individuals are created using two main genetic recombination operators known as crossover and mutation. Crossover operates by selecting a random location in the genetic saing of the parents (crossover point) and concatenating the initial segments of one parent with the final segment parent to create a new child. A second child is simultaneously generated using the remaining segments of the two parents. Mutation provides for occasional disturbances in the crossover operation to insure diversity in the genetic strings over long periods of rime and to prevent stagnation in the convergence of the optimization's technique. In order to apply GAs. one needs to choose a representation, defme the operators and the performance measure. In traditional GA's individuals in the population are typically represented using a binary string notation to promote efficiency and application independence in the genetic operations. The mathematical analysis of GAs shows that they work best when the internal representation encourages the emergence of useful building blocks that can be subsequently combined with others to produce improved performance and string representations are just one of many ways of achieving this. In old' application of GAs we do not use string representations. Each individual of the population is represented in VLl (previous section). as the different version of a given rule that is a subject for the optimization. The VL 1 representation is very "natural" for the optimization problem defined in this paper (search for a better performing rule) and can easily be manipulated by GA operators. The next section describes GA operators chosen for our method. 3

4 3.1 GA operators An initial population of rules is created by causing small variations to an existing rule. Then, a population of different variations of this rule is evolved by using GA's operators. Each rule in the population is evaluated based on its peiiormance within the set of initial rules. To evaluate the rule we use the tuning data which are pan of the set of learning examples. The perfonnance of the best rule in the population is monitored by matching this rule with the testing examples. Mutation is performed by introducing small changes to the condition pans (selectors) of a selected disjunct (rule complex) of a given rule. Th,e selectors to be changed are chosen by randomly generating two pointers. the first one that points to the rule disjunct complex. and the second one points to the selector of this disjunct. Random changes are introduced to the left-most or right-most (this is also chosen randomly) value of this selector. For example, the selector [x5= ] can be changed to one of the following; [x5=3,io [x5=3,io..241, [x5= ] or [x5= ]. Such a mutation process samples the space of possible boundaries between rules to minimize the coverage of noisy and irrelevant components of a given rule. The crossover operation is performed by splitting rule disjuncts into two parts, upper disjuncts and lower disjuncts. These parts are exchanged between parent rules to produce new child rules. Since degree of match of a given instance depends on the degree of match of this instance to each disjunct of that rule, this exchange process enables inheritance of information about strong disjuncts (suongly matching) in the individuals of the next evolved population. An example ofcrossover applied to short, four disjuncts rules is depicted below: Parent rule I I [xl=7..8] [x2=8.. 19] [x3=8.. 13] [xs=4..54] [x6=o..3] 2 [xl=is.. 54] [x3=11..14] [x4= ] [x6=o.. 9] [x7=o..11] crossover position [xl=9..18] [x3=16..21] [x4=9..10] 4 [xl=io.. 14] [x3= ] [x4=14..54] [x7=4.. 5] Parent rule 2 I [x3= ] [x4= ] [xs=o.. 6] [x7=5.. 12] 2 [xl=8.. 25] [x3=8.. 13] [x4=9.. 11] [x5=o.. 3] crossover position [x4=o. 22] [x5=8.. 9] [x6=o.. 7] [x7=11..48] 4 [x2=s. 8] [x3=7.. 8] [x4=8.. 11] [x5=o.. 3] Result of the crossover operation (one of two child rules) I [xl=7. 8] [x2=8.. 19] [x3=8.. 13] [xs=4..54] [x6=o..3] 2 [xl=is.. 54] [x3=11..14] [x4= ] [x6=o..9] [x7=o.. 11] 3 [x4=o.. 22] [x5=8.. 9] [x6=o.. 7] [x7=11..48] 4 [x2=5.. 8] [x3=7.. 8] [x4=8.. 11] [xs=o. 3] 4

5 The GA operators as described above are very effective in situations where no a priori infonnation is given on the distribution of attribute values. Son-nonnal distribution of attribute values complicates the AQ generated descriptions, especially if the learned rules consistently and completely cover input instances of the data. Graph 1 presents the most representative samples of non-normal attribute distribution obtained from the experiment with textured image classes described in section 4. Class: C9; Attribute:,,5 8~--~----,---~----,---~~ Class: C8; Attribute: xl O;--'~~~-r----r-~-+~~~ O~~--~---+--P-~~--+--r-;~ o 10 AUribute value o 10 AUribute value Class: C8; Attribute: x3 ~.g -l t-_,r--ii--1t a ~ -+--I-----t'--+--tfo~ + t i+---~--~~~--+---~~--~~ l+---~~--r---~--~~r-~ Class: Cll; Attribute:...7 4,-----~--~----~----~--~~ O~~--+--p--P-~~ ~~~ o 10 AUribuae value O~~~--~~~~--~~-r~~ o 10 AUribute value Graph 1. Examples of non-nonnal attribute distribution. The solid line corresponds to smoothed distribution of an attribute and the dotted line corresponds to the approximated nonnal distribution. The left hand diagrams of Graph 1 present a relatively simple attribute distribution. However, the right hand diagrams of Graph I present two cases of more complex attribute distributions. The upper right diagram indicates the possible formation of more than one cluster of training data. The lower right diagram presents very complex distribution of an attribute without distinction of regular clusters. One can notice that x7 5

6 attribute does not carry significant information about CII class. Considering the above complexity of the attributes. we postulate that this domain (data in a representation space expressed by such attributes) is a good candidate for the GA type of searches, where random changes (mutation) and some directional sampling (crossover) can yield high performance solutions (optimized rules). 3.2 Rule performance evaluation Each rule candidate in the population is evaluated by plugging it into the set of other rules and calculating the confusion mattix for the tuning examples. The confusion matrix represents infonnation on correct and incorrect classification results. This matrix is obtained by calculating a degree of flexible match between a tuning instance and a given class description. The row entries it) the matrix represent a percentage of matched instances from a given class (row index) to all class descriptions (column index). The following is an example of the confusion matrix for 12 classes: Cl C2 C3 C4 C5 C6 Cl C8 C9 CIO Cl1 C12 Cl 45 S C2 S C S C C C C C C I 2 CI0 0 0 I Cll C Average recognition=74.67% Average mis-classitication=4.83 The above confusion matrix represents classification results on data extracted from textured images. The method of extraction and texture classification by inductive learning is not the subject of research presented in this paper. The reader can find various information on the method of learning texture descriptions in the following papers [Bala and Pachowicz. 1991]. Let us suppose that the class Cll is the one chosen to be evolved by GAs mechanism. First we mutate this class representation multiple times to produce an initial population of different individuals of that class. To test the performance of a given individual we calculate the confusion matrix. The confusion matrix shows us how this specific variation of class Cll 6

7 perfonns when tested with other classes using the set of tuning examples. For example. we can see from me above confusion matrix mat the class C 11 has me following performance: Cl C2 C3 C4 C5 C6 C7 C8 C9 ClO Cll Cl2 CII The above values represent percentage of instances of class C 11 (set of tuning instances) matched to learned descriptions of all 12 classes. Ifclass CII is to be optimized. by GAs we have to evaluate the performance of each individual from the population of that class by calculating the performance measure of different candidates of that class with respect to other classes. Thus, as the performance evaluation measure of each individual of the population we use the ratio of correct recognition rates to incorrect recognition rates; CC/MC, where CC is an average recognition for correctly recognized classes (an average of entries on the diagonal of the confusion matrix). and MC is an average mis-classification (an average of entries outside the diagonal of the confusion matrix). For the above confusion matrix CC/MC=74.67/4.83= The next section describes how the distance measure between an example (tuning or testing) and a rule is calculated Reeo&nizin& class membership throu&h nexible matehin& Flexible matching measures the degree of closeness between an instance and the conditional pan of a rule. Such closeness is computed as distance value between an instance and rule complex within the attribute space. This distance measure takes any value in the range from o (Le., does not match) to 1 (Le. matches). Calculation of this distance measure for a single test instance is executed. according to the following schema. For a given condition ofa rule: [ Xn =valjl and an instance where Xn = valk ' the normalized distance measure is 1 ( I val j val k II #levels ) (2) where #levels is the total number of attrib\1te values. The confidence level of a rule is computed by multiplying evaluation values of each condition of the rule. The total evaluation of class membership of a given test instance is equal to the confidence level of the best matching rule, i.e. the rule of the highest confidence value. For example. the confidence value c for matching the following complex of a rule [ xl =0 ] [ x7 =10 ] [ x8 = ] with a given instance x = < 4. S. 24, 34, O. 12, 6, 25 > #levels=55 is computed as follows: and the number of attribute values c x } = 1 ( 10-41/55) =.928 C x 7 = 1 - ( / 55) =.928 c x 8 = 1 - ( ) =.91 Cx2, cx3. cx4' cxs, cx6 = 1 7

c = cxl * Cx2 * Cx) * Cx4 * Cx5 * Cx6 * Cx7 * C x 8 = 0.78 The recognition process yields class membership for which the confidence level is the highest among matched rules.

8 c = cxl * Cx2 * Cx) * Cx4 * Cx5 * Cx6 * Cx7 * C x 8 = 0.78 The recognition process yields class membership for which the confidence level is the highest among matched rules. Calculated confidence level, however, is not a probability measure and it can yield more than one class membership. It means that for a given preclassified test dataset. the system recognition effectiveness is calculated as the number of correctly classified instances to the total number of instances of the dataset. The above described method of rule evaluation by flexible matching and confidence level calculation has imponant consequences for the crossover operation. Since the degree of match of an instance to a rule depends on the confidence level of this instance to each disjunct of that rule, the swapping process introduced by crossover operator enables inheritance of infonnation represented in the strong disjuncts (with high confidence level) of rule candidates through generations of a population. 4. Experimental results In the experiment we used 12 classes of texture data. The initial descriptions were generated by the AQ module using 200 examples per class. Another set of 100 examples was used to calculat!' the perfonnance measure for the GAs cycle. The testing set had 200 examples extracted from different image areas other than training and tuning data.. Examples of a given class were constructed using 8 attributes and an extraction process based on Law's mask [Bala and Pachowicz, 1991]. The "weakest" class (with the lowest matching results with testing data) chosen for the experiment was class Cll (like in the confusion matrix example in section 3.2) with 40 disjuncts. Graph 2 represents results of the experiment. White circles of the diagrams represent characteristics obtained for tuning data used to guide the genetic search. Characteristics mapped by black circles were obtained for testing data. Upper diagram monitors the performance of genetically evolved description of Cll texture. When all 300 examples were used to generate rules, the average classification rate for this class (when tested with 200 examples) was below 45%. When the set of 300 examples was split into two parts, 200 for the initial inductive learning and 100 as the tuning data for GAs cycle, the correct classification rates obtainedtin the 30th evolution was above 60%. That is a significant increase in comparison with 45% obtained from inductive learning only. The bottom diagrams represent the evaluation function used by genetic algorithm (COMe, section 3.2) in order to guide the genetic sean:h. The evaluation function was calculated as a rate of the correct classifications to mis-classifications for all twelve texture classes. These diagrams are depicted for both testing and tuning data. The increases of CC/MC on both diagrams represent an overall improvement of system recognition performance. The system performance was investigated for a larger number of GA generation steps. However, it appears, that the substantial increase was reached both for the CII class description and for the overall system performance in a very few generation steps (Le., in 10 steps). 8

9 o Results for tuning dat.a Results for testing data 30~----~A ~ 20;---~--~----~--~~--~--~ o 10 Generations ~ 0 "1 ~ r J roo II MM be, u ~ /\.J ~ f \-\ o 10 Generations Generations 30 Graph 2. Results of experiment with 12 classes. s. Conclusion This paper presents a novel approach to the optimization of the classification rule descriptions through GAs techniques. We recognize rule optimization as an emerging research area in the creation of autonomous intelligent systems. Two issues appear to be important for this research. First, rule optimization is absolutely necessary in such problems where the system must be protected against the accumulation of noisy components and where attributes used to describe initial data have complex. non-nonna! distributions. The method can be augmented by 9

10 optimizing input data formation (searching for the best subset of attributes) prior to rule generation [Vafaie, De Jong 1991]. Secondly, for inducti...ely generated rules (falsity preserving rules) there should be some method of validating these rules (or choosing the most preferred inductive rule) before applying them to future data. The methoo presented in this paper tries to offer some solutions to these problems. The presented experiments on applying this hybrid methoo to finer-grained symbolic representations of vision data are encouraging. Highly disjunctive descriptions obtained by inductive learning from this data were easily manipulated by chosen GA operators and a substantial increase in the overall system perfonnance was observed in a few generations. Although there exists substantial differences between GAs and symbolic induction methods (learning algorithm, performance elements, knowledge representation), this method serves as a promising example that a better understanding of abilities of each approach (identifying common building blocks) can lead to novel and useful ways of combining them. This paper concentrates primarily on the proposed methodology. More experiments are needed to draw concrete conclusions. Future research will attempt to thoroughly test this approach on numerous data sets. We will also examine more thoroughly mutation and crossover operators. Acknowledgement: This research was supponed in pan by the Office of Naval Research under grants No. NOOOI4-88-K-0397 and No. NOOO14-88-K-0226, and in pan by the Defense Advan~ed Research Projects Agency under the grant administered by the Office of Naval Research No. NOOOI4-K The authors wish to thank: Janet Holmes and Haleh Vafaie for editing suggestions. References Bala. I.W. and Pachowicz, P.W., "Application of Symbolic Machine Learning to the Recognition oftexture Concepts" 7th IEEE Conference on Artificial Intelligence Applications, Miami Beach FL, February De Jong, K "Learning with Genetic Algorithms: An Overview". Machine Learning vol. 3; , Kluwer Academic Publishers Vafaie H.and De Jong, K.,"Improving the Performance of a Rule Induction System Using Genetic Algorithms"; in preparation.(center for AI, GMU 1991) Fitzpatrick. J.M. and Grefenstette I., "Genetic Algorithms in Noisy Environments", Machine Leaming 3 pp (1988). MichalskiR. S. "AQVAUI--ComputerImplementation ofa Variable-Valued Logic System VLl and Examples of Its Application to Pattern Recognition", First International Joint Conference on Pattern Recognition, October 30, 1973, Washington D.C. Michalski, R.S. Mozetic I., Hong I.R., Lavrac N. "The AQ15 Inductive Learning System", Report No. UIUCDCS-R , Department of Computer Science, University of Illinois at Urbane-Champaign, July

Learning From Inconsistent and Noisy Data: The AQ18 Approach *

Eleventh International Symposium on Methodologies for Intelligent Systems, Warsaw, pp. 411-419, 1999 Learning From Inconsistent and Noisy Data: The AQ18 Approach * Kenneth A. Kaufman and Ryszard S. Michalski*