Rule Generation using Decision Trees

Similar documents
Classification Using Decision Trees

Learning Decision Trees

Classification and regression trees

Learning Decision Trees

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Classification: Decision Trees

Learning Classification Trees. Sargur Srihari

Decision Trees Part 1. Rao Vemuri University of California, Davis

Data classification (II)

Induction on Decision Trees

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Artificial Intelligence. Topic

The Quadratic Entropy Approach to Implement the Id3 Decision Tree Algorithm

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Machine Learning 2nd Edi7on

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Decision Tree Learning and Inductive Inference

CS 6375 Machine Learning

( D) I(2,3) I(4,0) I(3,2) weighted avg. of entropies

Decision Trees. Common applications: Health diagnosis systems Bank credit analysis

the tree till a class assignment is reached

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Machine Learning & Data Mining

Decision Trees. Gavin Brown

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Decision Trees. Danushka Bollegala

Classification and Prediction

Classification and Regression Trees

Decision Tree Learning - ID3

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Bayesian Classification. Bayesian Classification: Why?

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Decision Tree Learning

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Administrative notes. Computational Thinking ct.cs.ubc.ca

Decision Tree Learning

Decision Support. Dr. Johan Hagelbäck.

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Decision Trees. Tirgul 5

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Symbolic methods in TC: Decision Trees


Induction of Decision Trees

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Decision Tree Learning

Chapter 6: Classification

Symbolic methods in TC: Decision Trees

Machine Learning Alternatives to Manual Knowledge Acquisition

EECS 349:Machine Learning Bryan Pardo

Lecture 3: Decision Trees

Dan Roth 461C, 3401 Walnut

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Machine Learning 3. week

Data Mining Part 4. Prediction

Decision Tree. Decision Tree Learning. c4.5. Example

Leveraging Randomness in Structure to Enable Efficient Distributed Data Analytics

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

Decision T ree Tree Algorithm Week 4 1

Decision Trees / NLP Introduction

The Solution to Assignment 6

Lecture 3: Decision Trees

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

ARTIFICIAL INTELLIGENCE. Supervised learning: classification

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Decision Tree And Random Forest

Decision Trees.

CS145: INTRODUCTION TO DATA MINING

Chapter 4.5 Association Rules. CSCI 347, Data Mining

Classification: Decision Trees

Decision Trees.

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Decision Tree Learning

Typical Supervised Learning Problem Setting

Bias Correction in Classification Tree Construction ICML 2001

UVA CS 4501: Machine Learning

Decision Tree Learning

Einführung in Web- und Data-Science

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Inductive Learning. Chapter 18. Material adopted from Yun Peng, Chuck Dyer, Gregory Piatetsky-Shapiro & Gary Parker

Data Mining Project. C4.5 Algorithm. Saber Salah. Naji Sami Abduljalil Abdulhak

Modern Information Retrieval

Decision Trees Entropy, Information Gain, Gain Ratio

Abduction in Classification Tasks

Reminders. HW1 out, due 10/19/2017 (Thursday) Group formations for course project due today (1 pt) Join Piazza (

Machine Learning Chapter 4. Algorithms

Algorithms for Classification: The Basic Methods

Empirical Approaches to Multilingual Lexical Acquisition. Lecturer: Timothy Baldwin

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Jialiang Bao, Joseph Boyd, James Forkey, Shengwen Han, Trevor Hodde, Yumou Wang 10/01/2013

Growing a Large Tree

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Transcription:

Rule Generation using Decision Trees Dr. Rajni Jain 1. Introduction A DT is a classification scheme which generates a tree and a set of rules, representing the model of different classes, from a given dataset. As per Hans and Kamber [HK01], DT is a flow chart like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test and leaf nodes represent the classes or class distributions. The top most node in a tree is the root node. Figure 1 refers to DT induced for dataset in Table 1. We can easily derive the rules corresponding to the tree by traversing each leaf of the tree starting from the node. It may be noted that many different leaves of the tree may refer to the same class labels, but each leaf refers to a different rule. DTs are attractive in DM as they represent rules which can readily be expressed in natural language. The major strength of the DT methods are the following: 1. DT are able to generate understandable rules. 2. They are able to handle both numerical and categorical attributes. 3. They provide a clear indication of which fields are most important for prediction or classification. 2 Decision Tree Induction DT induction is a well defined technique in statistics and machine learning. A common basic principle of all DT induction algorithms is outlined below. 2.1 Basic Principle (Hunt s method) All DT induction algorithms follow the basic principle, known as CLS (Concept Learning system), given by Hunt [HH61, HMS66]. Ross Quinlan attributes his work on trees (ID3, C4.5) as furtherance of Hunt s ideas of CLS. A CLS tries to mimic the human process of learning a concept, starting with examples from two classes and then inducing a rule to distinguish the two classes based on other attributes. Let the training dataset be T with class-labels {C1, C2,, Ci}. The decision tree is built by repeatedly partitioning the training data using some splitting criterion till all the records in a partition belong to the same class. The steps to be followed are: 1. If T contains no cases (T is trivial), the decision tree for T is a leaf, but the class to be associated with the leaf must be determined from information other than T. 2. If T contains cases all belonging to a single class Cj (homogeneous), corresponding tree is a leaf identifying class Cj. 3. If T is not homogeneous, a test is chosen, based on a single attribute, that has one or more mutually exclusive outcomes {O1, O2,, On}. T is partitioned into subsets T1, T2, T3,, Tn, where Ti contains all those cases in T that have the outcome Oi of the chosen test. The decision tree for T consists of a decision node identifying the test, and one branch for each possible outcome. The same tree building method is applied recursively to each subset of training cases.

2.2 Measures of the Diversity The diversity index is a well developed topic with different names corresponding the various fields. To statistical biologist, it is Simpson diversity index. To cryptographers, it is one minus the repeat rate. To econometricians, it is the Gini index that is also used by the developers of the CART algorithm. Quinlan [Qui87, Qui93] used entropy as devised by Shanon in the information theory [Sha48]. A high index of diversity indicates that the set contains an even distribution of classes whereas a low index means that members of a single class predominate. The best splitter is the one that decreases the diversity of the record sets by the greatest amount. The three common diversity functions are discussed here. Let there be a dataset S (training data) of C outcomes. Let P(I) denotes the proportion of S belonging to a class I where I varies from 1 to C for the classification problem with C classes. Simple Diversity index = Min(p(I)) (1) Entropy provides an information theoretic approach to measure the goodness of a split. It measures the amount of information in an attribute. Entropy( S) = C I = 1 ( p( I)log 2 p( I)) (2) Gain(S,, the information gain of the example set S on an attribute A, defined as S V Gain ( S, Entropy( S) * Entropy S ( S ) = V (3) where is over each value V of all the possible values of the attribute A, SV = subset of S for which attribute A has value V, SV = number of elements in SV, and S = number of elements in S. The above notion of gain tends to favour the attributes that have a larger number of values. To compensate for this, Quinlan [Qui87] suggests using the gain ratio instead of gain, as formulated below. Gain Ratio( S, = Gain( S, SplitInfo( S, (4) where SplitInfo(S, is the information due to the split of S on the basis of the value of the categorical attribute A. Thus SplitInfo(S, is entropy due to the partition of S induced by the value of the attribute A. Gini index measures the diversity of population using the formula Gini Index = 1 I 2 ( p( ) ) (5) where p(i) is the proportion of S belonging to class I and is over C.

In this thesis, entropy, information gain and gain ratio (equations (2), (3) and (4)) have been used to determine the best split. All of the above functions attain a maximum where the probabilities of the classes are equal and evaluate to zero when the set contains only a single class. Between the two extremes all these functions have slightly different shapes. As a result, they produce slightly different rankings of the proposed splits. 2.3 ID3 Algorithms Many DT induction algorithms have been developed in the past over the years. These algorithms more or less follow the basic principle discussed above. Discussion of all these algorithms is beyond the scope of the present chapter. Quinlan [Qui93] introduced the Iterative Dichotomizer 3 (ID3) for constructing the decision tree from data. In ID3, each node corresponds to a splitting attribute and each branch is a possible value of that attribute. At each node, the splitting attribute is selected to be the most informative among the attributes not yet considered in the path from the root. The algorithm uses the criterion of information gain to determine the goodness of a split. The attribute with the greatest information gain is taken as the splitting attribute, and the dataset is split for all distinct values of the attribute. 3. Illustration 3.1 Description of the Data Set Table 1: Weather Dataset ID Outlook Temperature Humidity Wind Play X1 Sunny hot high weak no X2 Sunny hot high strong no X3 Overcast hot high weak yes X4 Rain mild high weak yes X5 Rain cool normal weak yes X6 Rain cool normal strong no X7 Overcast cool normal strong yes X8 Sunny mild high weak no X9 Sunny cool normal weak yes X10 Rain mild normal weak yes X11 Sunny mild normal strong yes X12 Overcast mild high strong yes X13 Overcast hot normal weak yes X14 Rain mild high strong no

The Weather dataset ( Table 1) is a small dataset and is entirely fictitious. It supposedly concerns the conditions that are suitable for playing some unspecified game. The condition attributes of the dataset are in {Outlook, Temperature, Humidity, Wind} and the decision attribute is Play to denote whether or not to play. In its simplest form, all four attributes have categorical values. Values of the attributes Outlook, Temperature, Humidity and Wind are in {sunny, overcast, rainy}, {hot, mild, cool}, {high, normal} and {weak, strong} respectively. The values of all the four attributes produce 3*3*2*2 = 36 possible combinations out of which 14 are present as input. 3.2 Example 1 To induce DT using ID3 algorithm for Table 1, note that S is a collection of 14 examples with 9 yes and 5 no examples. Using equation (2), Entropy (S) = - (9/14)* log2(9/14) (5/14)*log2(5/14) = 0.940 There are 4 conditional attributes in this table. In order to find the attribute that can serve as the root node in the decision tree to be induced, the information gain is calculated corresponding to all the 4 attributes. For attribute Wind, there are 8 occurrences of Wind = weak and 6 occurrences of Wind = strong. For Wind = weak, 6 out of the 8 examples are yes and the remaining 2 are no. For Wind = strong, 3 out of the 6 examples are yes and the remaining 3 are no. Therefore, by using equation (3) Gain (S, Wind) = Entropy (S) - (8/14) * Entropy (Sweak) - (6/14) * Entropy (Sstrong) = 0.940 - (8/14) (0.811) - (6/14) * 1.0 = 0.048 Entropy (Sweak) = - (6/8) * log2 (6/8) - 2/8 * log2 (2/8) = 0.811 Entropy (Sstrong) = - (3/6) * log2 (3/6) - (3/6) * log2 (3/6) =1.0 Similarly Gain (S, Outlook) = 0.246 Gain (S, Temperature) = 0.029 Gain (S, Humidity) = 0. 151 Outlook has the highest gain, therefore it is used as the decision attribute in the root node. Since Outlook has three possible values, the root node has three branches (sunny, overcast, rain). The next question is which of the remaining attributes should be tested at the branch node sunny? Ssunny = {X1, X2, X8, X9, X11} i.e. there are 5 examples from Table 1 with Outlook = sunny. Thus using equation (3), Gain (Ssunny, Humidity) = 0.970 Gain (Ssunny, Temperature) = 0.570 Gain (Ssunny, Wind) = 0.019

Outlook sunny overcast rain Humidity yes Wind high normal weak strong no yes yes no Figure 1: ID3 classifier for Weather dataset The above calculations show that the attribute Humidity shows the highest gain, therefore, it should be used as the next decision node for the branch sunny. This process is repeated until all data are classified perfectly or no attribute is left for the child nodes further down the tree. The final decision tree as obtained using ID3 is shown later in Figure 1. The corresponding rules are: 1. If outlook=sunny and humidity=high then play=no 2. If outlook=sunny and humidity=normal then play=yes 3. If outlook=overcast then play=yes 4. If outlook=overcast and wind=weak then play=yes 5. If outlook=overcast and wind=strong then play=no 4 Summary Notions and algorithms for DT induction are discussed with special reference to ID3 algorithm. A small hypothetical data set is used to illustrate the concept of decision tree using ID3 algorithm. The decision tree can be mapped easily to generate rules. References [BFOS84] Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J., Classification and Regression Trees, Wadsworth 1984 [HK01] Han, J., Kamber, M. Data Mining Concepts and Techniques, Morgan Kaufmann Publisher, 2001

[Qui87] Quinlan, J. R., Simplifying Decision Trees, International Journal of Man-Machine Studies, 27:221-234, 1987 [Qui93] Quinlan, J. R.: C4.5: Programs for Machine Learning, Morgan Kauffman 1993 [SAM96] [Sha48] Shafer, J., Agrawal, R., Mehta, M., SPRINT: A Scalable Parallel Classifier for Data Mining, In Procd. 22nd Int. Conf. Very Large Databases, VLDB, 1996 Shanon, C., A Mathematical Theory of Communication, The Bell Systems Technical Journal, 27:379-423, 623-656, 1948