Classification Using Decision Trees

Similar documents
Rule Generation using Decision Trees

Learning Decision Trees

Learning Decision Trees

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Classification and regression trees

Learning Classification Trees. Sargur Srihari

Classification: Decision Trees

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

CS 6375 Machine Learning

Data classification (II)

Induction of Decision Trees

The Quadratic Entropy Approach to Implement the Id3 Decision Tree Algorithm

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

the tree till a class assignment is reached

Decision Trees Part 1. Rao Vemuri University of California, Davis

Decision Trees. Danushka Bollegala

Classification and Regression Trees

Induction on Decision Trees

Decision Tree Learning and Inductive Inference

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Artificial Intelligence. Topic

( D) I(2,3) I(4,0) I(3,2) weighted avg. of entropies

Machine Learning & Data Mining

Machine Learning 2nd Edi7on

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Classification and Prediction

Decision Trees. Common applications: Health diagnosis systems Bank credit analysis

Decision T ree Tree Algorithm Week 4 1

Decision Tree Learning - ID3

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Bayesian Classification. Bayesian Classification: Why?

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Decision Tree Learning

Decision Trees. Gavin Brown

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Decision Tree Learning

Machine Learning Alternatives to Manual Knowledge Acquisition

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Administrative notes. Computational Thinking ct.cs.ubc.ca

Lecture 3: Decision Trees

Symbolic methods in TC: Decision Trees

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Lecture 3: Decision Trees

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Chapter 6: Classification

Data Mining Part 4. Prediction

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Decision Support. Dr. Johan Hagelbäck.

EECS 349:Machine Learning Bryan Pardo

Symbolic methods in TC: Decision Trees

Decision Tree Learning


ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

Decision Tree. Decision Tree Learning. c4.5. Example

CS145: INTRODUCTION TO DATA MINING

Decision Trees. Tirgul 5

Machine Learning 3. week

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Tree And Random Forest

Dan Roth 461C, 3401 Walnut

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Decision Trees / NLP Introduction

Leveraging Randomness in Structure to Enable Efficient Distributed Data Analytics

ARTIFICIAL INTELLIGENCE. Supervised learning: classification

Modern Information Retrieval

UVA CS 4501: Machine Learning

Decision Trees.

The Solution to Assignment 6

Chapter 4.5 Association Rules. CSCI 347, Data Mining

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Decision Tree Learning

Classification: Decision Trees

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Decision Trees.

Growing a Large Tree

Bias Correction in Classification Tree Construction ICML 2001

Data Mining Project. C4.5 Algorithm. Saber Salah. Naji Sami Abduljalil Abdulhak

Modern Information Retrieval

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Einführung in Web- und Data-Science

Algorithms for Classification: The Basic Methods

Abduction in Classification Tasks

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Typical Supervised Learning Problem Setting

day month year documentname/initials 1

Bayesian Learning Features of Bayesian learning methods:

Supervised Learning via Decision Trees

Decision Tree Learning

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Lecture VII: Classification I. Dr. Ouiem Bchir

Jialiang Bao, Joseph Boyd, James Forkey, Shengwen Han, Trevor Hodde, Yumou Wang 10/01/2013

Decision Trees Entropy, Information Gain, Gain Ratio

Transcription:

Classification Using Decision Trees 1. Introduction Data mining term is mainly used for the specific set of six activities namely Classification, Estimation, Prediction, Affinity grouping or Association rules, Clustering, Description and Visualization. The first three tasks - classification, estimation and prediction are all examples of directed data mining or supervised learning. Decision Tree (DT) is one of the most popular choices for learning from feature based examples. It has undergone a number of alterations to deal with the language, memory requirements and efficiency considerations. A DT is a classification scheme which generates a tree and a set of rules, representing the model of different classes, from a given dataset. As per Hans and Kamber [HK01], DT is a flow chart like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test and leaf nodes represent the classes or class distributions. The top most node in a tree is the root node. Figure 1 refers to DT induced for dataset in Table 1. We can easily derive the rules corresponding to the tree by traversing each leaf of the tree starting from the node. It may be noted that many different leaves of the tree may refer to the same class labels, but each leaf refers to a different rule. DTs are attractive in DM as they represent rules which can readily be expressed in natural language. The major strength of the DT methods are the following: 1. DT are able to generate understandable rules. 2. They are able to handle both numerical and categorical attributes. 3. They provide a clear indication of which fields are most important for prediction or classification. Some of the weaknesses of DT are: 1. Some DT can only deal with binary valued target classes, others are able to assign records to an arbitrary number of classes, but are error prone when the number of training examples per class gets small. This can happen rather quickly in a tree with many levels and many branches per node. 2. The process of growing a DT is computationally expensive. At each node, each candidate splitting field is examined before its best split can be found. Many classification models have been proposed in the literature. Classification trees also called Decision Trees (DT) are especially attractive in a data mining environment for several reasons. First, due to their intuitive representation, the resulting classification model is easy to assimilate by humans [BFOS84, SAM96]. Second, decision trees do not require any parameter setting from the user and thus are especially suited for exploratory knowledge discovery. Third, DT can be constructed relatively fast and the accuracy of DT is comparable or superior to other classification models.

2. Decision Tree Induction DT induction is a well defined technique in statistics and machine learning. A common basic principle of all DT induction algorithms is outlined below. 2.1 Basic Principle (Hunt s method) All DT induction algorithms follow the basic principle, known as CLS (Concept Learning system), given by Hunt [HH61, HMS66]. Ross Quinlan attributes his work on trees (ID3, C4.5) as furtherance of Hunt s ideas of CLS. A CLS tries to mimic the human process of learning a concept, starting with examples from two classes and then inducing a rule to distinguish the two classes based on other attributes. Let the training dataset be T with class-labels {C1, C2,, Ci}. The decision tree is built by repeatedly partitioning the training data using some splitting criterion till all the records in a partition belong to the same class. The steps to be followed are: 1. If T contains no cases (T is trivial), the decision tree for T is a leaf, but the class to be associated with the leaf must be determined from information other than T. 2. If T contains cases all belonging to a single class Cj (homogeneous), corresponding tree is a leaf identifying class Cj. 3. If T is not homogeneous, a test is chosen, based on a single attribute, that has one or more mutually exclusive outcomes {O1, O2,, On}. T is partitioned into subsets T1, T2, T3,, Tn, where Ti contains all those cases in T that have the outcome Oi of the chosen test. The decision tree for T consists of a decision node identifying the test, and one branch for each possible outcome. The same tree building method is applied recursively to each subset of training cases. 2.2 Measures of the Diversity The diversity index is a well developed topic with different names corresponding the various fields. To statistical biologist, it is Simpson diversity index. To cryptographers, it is one minus the repeat rate. To econometricians, it is the Gini index that is also used by the developers of the CART algorithm. Quinlan [Qui87, Qui93] used entropy as devised by Shanon in the information theory [Sha48]. A high index of diversity indicates that the set contains an even distribution of classes whereas a low index means that members of a single class predominate. The best splitter is the one that decreases the diversity of the record sets by the greatest amount. The three common diversity functions are discussed here. Let there be a dataset S (training data) of C outcomes. Let P(I) denotes the proportion of S belonging to a class I where I varies from 1 to C for the classification problem with C classes. Simple Diversity index = Min(p(I)) (1) Entropy provides an information theoretic approach to measure the goodness of a split. It measures the amount of information in an attribute. Entropy( S) = C I = 1 ( p( I)log 2 p( I)) (2) 183

Gain(S,, the information gain of the example set S on an attribute A, defined as S ( ) V Gain ( S, = Entropy( S) * Entropy SV S (3) where is over each value V of all the possible values of the attribute A, SV = subset of S for which attribute A has value V, SV = number of elements in SV, and S = number of elements in S. The above notion of gain tends to favour the attributes that have a larger number of values. To compensate for this, Quinlan [Qui87] suggests using the gain ratio instead of gain, as formulated below. Gain Ratio( S, = Gain( S, SplitInfo( S, (4) where SplitInfo(S, is the information due to the split of S on the basis of the value of the categorical attribute A. Thus SplitInfo(S, is entropy due to the partition of S induced by the value of the attribute A. Gini index measures the diversity of population using the formula Gini Index = 1 I 2 ( p( ) ) (5) where p(i) is the proportion of S belonging to class I and is over C. In this thesis, entropy, information gain and gain ratio (equations (2), (3) and (4)) have been used to determine the best split. All of the above functions attain a maximum where the probabilities of the classes are equal and evaluate to zero when the set contains only a single class. Between the two extremes all these functions have slightly different shapes. As a result, they produce slightly different rankings of the proposed splits. 2.3 ID3 Algorithms Many DT induction algorithms have been developed in the past over the years. These algorithms more or less follow the basic principle discussed above. Discussion of all these algorithms is beyond the scope of the present chapter. Quinlan [Qui93] introduced the Iterative Dichotomizer 3 (ID3) for constructing the decision tree from data. In ID3, each node corresponds to a splitting attribute and each branch is a possible value of that attribute. At each node, the splitting attribute is selected to be the most informative among the attributes not yet considered in the path from the root. The algorithm uses the criterion of information gain to determine the goodness of a split. The attribute with the greatest information gain is taken as the splitting attribute, and the dataset is split for all distinct values of the attribute. 184

3. Illustration 3.1 Description of the Data Set Table 1: Weather Dataset ID Outlook Temperature Humidity Wind Play X1 sunny hot high weak no X2 sunny hot high strong no X3 overcast hot high weak yes X4 rain mild high weak yes X5 rain cool normal weak yes X6 rain cool normal strong no X7 overcast cool normal strong yes X8 sunny mild high weak no X9 sunny cool normal weak yes X10 rain mild normal weak yes X11 sunny mild normal strong yes X12 overcast mild high strong yes X13 overcast hot normal weak yes X14 rain mild high strong no The Weather dataset ( Table 1) is a small dataset and is entirely fictitious. It supposedly concerns the conditions that are suitable for playing some unspecified game. The condition attributes of the dataset are in {Outlook, Temperature, Humidity, Wind} and the decision attribute is Play to denote whether or not to play. In its simplest form, all four attributes have categorical values. Values of the attributes Outlook, Temperature, Humidity and Wind are in {sunny, overcast, rainy}, {hot, mild, cool}, {high, normal} and {weak, strong} respectively. The values of all the four attributes produce 3*3*2*2 = 36 possible combinations out of which 14 are present as input. 3.2 Example 1 To induce DT using ID3 algorithm for Table 1, note that S is a collection of 14 examples with 9 yes and 5 no examples. Using equation (2), Entropy (S) = - (9/14)* log2(9/14) (5/14)*log2(5/14) = 0.940 There are 4 conditional attributes in this table. In order to find the attribute that can serve as the root node in the decision tree to be induced, the information gain is calculated corresponding to all the 4 attributes. For attribute Wind, there are 8 occurrences of Wind = weak and 6 occurrences of Wind = strong. For Wind = weak, 6 out of the 8 examples are 185

yes and the remaining 2 are no. For Wind = strong, 3 out of the 6 examples are yes and the remaining 3 are no. Therefore, by using equation (3) Gain (S, Wind) = Entropy (S) - (8/14) * Entropy (Sweak) - (6/14) * Entropy (Sstrong) = 0.940 - (8/14) (0.811) - (6/14) * 1.0 = 0.048 Entropy (Sweak) = - (6/8) * log2 (6/8) - 2/8 * log2 (2/8) = 0.811 Entropy (Sstrong) = - (3/6) * log2 (3/6) - (3/6) * log2 (3/6) =1.0 Similarly Gain (S, Outlook) = 0.246 Gain (S, Temperature) = 0.029 Gain (S, Humidity) = 0. 151 Outlook has the highest gain, therefore it is used as the decision attribute in the root node. Since Outlook has three possible values, the root node has three branches (sunny, overcast, rain). The next question is which of the remaining attributes should be tested at the branch node sunny? Ssunny = {X1, X2, X8, X9, X11} i.e. there are 5 examples from Table 1 with Outlook = sunny. Thus using equation (3), Gain (Ssunny, Humidity) = 0.970 Gain (Ssunny, Temperature) = 0.570 Gain (Ssunny, Wind) = 0.019 The above calculations show that the attribute Humidity shows the highest gain, therefore, it should be used as the next decision node for the branch sunny. This process is repeated until all data are classified perfectly or no attribute is left for the child nodes further down the tree. The final decision tree as obtained using ID3 is shown later in Figure 1. The corresponding rules are: 1. If outlook=sunny and humidity=high then play=no 2. If outlook=sunny and humidity=normal then play=yes 3. If outlook=overcast then play=yes 4. If outlook=overcast and wind=weak then play=yes 5. If outlook=overcast and wind=strong then play=no 186

Outlook sunny overcast rain Humidity yes Wind high normal weak strong no yes yes no Figure 1: ID3 classifier for Weather dataset 4. Summary Notions and algorithms for DT induction are discussed with special reference to ID3 algorithm. A small hypothetical data set is used to illustrate the concept of decision tree using ID3 algorithm. References [BFOS84] Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J., Classification and Regression Trees, Wadsworth 1984 [HK01] [Qui87] [Qui93] [SAM96] [Sha48] Han, J., Kamber, M. Data Mining Concepts and Techniques, Morgan Kaufmann Publisher, 2001 Quinlan, J. R., Simplifying Decision Trees, International Journal of Man-Machine Studies, 27:221-234, 1987 Quinlan, J. R.: C4.5: Programs for Machine Learning, Morgan Kauffman 1993 Shafer, J., Agrawal, R., Mehta, M., SPRINT: A Scalable Parallel Classifier for Data Mining, In Procd. 22nd Int. Conf. Very Large Databases, VLDB, 1996 Shanon, C., A Mathematical Theory of Communication, The Bell Systems Technical Journal, 27:379-423, 623-656, 1948 187