Machine Learning 3. week

Similar documents
Decision Trees. Tirgul 5

the tree till a class assignment is reached

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Learning Decision Trees

Decision Trees. Gavin Brown

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Decision Trees Part 1. Rao Vemuri University of California, Davis

Decision Trees. Common applications: Health diagnosis systems Bank credit analysis

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Learning Decision Trees

Notes on Machine Learning for and

Machine Learning 2nd Edi7on

CS 6375 Machine Learning

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Machine Learning & Data Mining

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

CS145: INTRODUCTION TO DATA MINING

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Lecture 7: DecisionTrees

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Artificial Intelligence Decision Trees

Classification: Decision Trees

EECS 349:Machine Learning Bryan Pardo

Decision Tree Learning

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Lecture 3: Decision Trees

Classification and Regression Trees

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Lecture 3: Decision Trees

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Decision Trees. Danushka Bollegala

10-701/ Machine Learning: Assignment 1

Information Theory & Decision Trees

Supervised Learning via Decision Trees

Lecture 7 Decision Tree Classifier

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Artificial Intelligence. Topic

Rule Generation using Decision Trees

Decision Support. Dr. Johan Hagelbäck.

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Empirical Risk Minimization, Model Selection, and Model Assessment

Introduction to Machine Learning CMU-10701

day month year documentname/initials 1

Classification Using Decision Trees

CHAPTER-17. Decision Tree Induction

Dan Roth 461C, 3401 Walnut

Decision Tree Learning

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Decision Tree Learning - ID3

The Quadratic Entropy Approach to Implement the Id3 Decision Tree Algorithm

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Decision trees COMS 4771

Induction on Decision Trees

Decision Trees.

Data classification (II)

Decision Tree Learning Lecture 2


Decision Trees. Introduction. Some facts about decision trees: They represent data-classification models.

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Chapter 6: Classification

Classification and Prediction

Jeffrey D. Ullman Stanford University

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Apprentissage automatique et fouille de données (part 2)

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Induction of Decision Trees

Knowledge Discovery and Data Mining

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Decision Tree Learning

Chapter 3: Decision Tree Learning

Information Theory, Statistics, and Decision Trees

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Tutorial 6. By:Aashmeet Kalra

Informal Definition: Telling things apart

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Decision Tree Learning

Decision Trees.

Machine Learning 2010

PATTERN RECOGNITION AND MACHINE LEARNING

Decision Tree And Random Forest

Decision T ree Tree Algorithm Week 4 1

Holdout and Cross-Validation Methods Overfitting Avoidance

Learning Classification Trees. Sargur Srihari

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Machine Learning Alternatives to Manual Knowledge Acquisition

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Classification and regression trees

Decision Tree Learning

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Decision Trees / NLP Introduction

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Decision Trees Entropy, Information Gain, Gain Ratio

Decision Tree. Decision Tree Learning. c4.5. Example

UVA CS 4501: Machine Learning

Transcription:

Machine Learning 3. week Entropy Decision Trees ID3 C4.5 Classification and Regression Trees (CART) 1 What is Decision Tree As a short description, decision tree is a data classification procedure which is made by asking questions in right order. 1

Decision Tree Sample Wheather Sunny Rainy Humidity Cloudy Wind High Normal Strong Weak No No 3 Decision Trees It start with entropy based classification tree methods such as ID3 C4.5. But first, we should learn something about entropy, uncertainty and information. 4

Entropy and Uncertainty In probability theory, entropy is a measure of the uncertainty about a random variable. In information theory, entropy (Shannon) is the expected value (average) of information in each situation of the random variable. According to Shannon, the proper choice for "function to measure information" must be logarithmic. 5 Uncertainty and Information The entropy and uncertainty are maximized if all situations of the random variable have the same probability. The difference between prior and posterior probability distributions shows the amount of information gained or lost (Kullback Leibler divergence). Accordingly, posteriors of situations with maximum uncertainty causes likely maximum information gain. 6 3

Information In fact, both information and uncertainty are like antonyms to each other. But since maximum uncertainty can cause maximum information, information gain and uncertainty focus on the same concept. Self information is represented by 1 I( x) log log P( x) P( x) 7 Entropy Shannon Entropy (H) term is the expected value of all a i situations with P i probabilities. H( X ) E( I( X )) n i1 P( x )log i 1in P( x ). I( x ) i i 1 P( x ) i n i1 P log i P i 8 4

Entropy Let X be random process of a coin toss. Since its two situations have the same probability, the entropy of X is found 1 as below. H ( X ) i1 (0.5log p i log p i 0.5 0.5log 0.5) 1 We can comment it as that after the process, we will win 1 bit of information. 9 Entropy in Decision Tree A decision tree divides multidimensional data into several sets with a condition on specified feature. In each case, to decide working on which feature of data and based on which condition is needed solving a big combinational problem. 10 5

Entropy in Decision Tree Even only in a data with 0 samples and 5 features, we can construct more than 106 decision trees. Therefore, each branching should be performed methodologically. 11 Entropy in Decision Tree According to Quinlan, we should branch the tree by gaining maximum information. For this reason, it must be found the feature which causes the minimum uncertainty. In the ID3, all branches are examined individually, and feature causing the minimum uncertainty is preferred to make the branching on the tree. 1 6

ID3 Algorithm It works with only categorical data. In the first step of each iteration, entropy of the data set is computed for all features. Then entropy of each feature depending on the class is computed, and it is subtracted from the entropy calculated in the first step. The feature is chosen, which supports maximum information gain. 13 ID3 Sample V1 V S A C E B C F B D E B D F We have a dataset with features (V1 and V), a class vector (S), and 4 samples. Find the first branching feature? H(S) - H(V1,S)=? H(S) - H(V,S)=? 14 7

ID3 Sample V1 V S A C E B C F B D E B D F Entropy of V At first, general entropy: 1 H S) log 1 1 log 1 ( Entropy of V1 H ( V1) 1 1 1 1 ( V ) H ( C) H ( D) 1 1 3 H ( A) H( B) 4 4 1 3 1 1 0 log log 4 4 3 3 3 3 0 0,9183 0,6887 4 H We choose V1. 1 3 15 C4.5 Algorithm It is a form of ID3 algorithm that can be applied to numerical data. It transforms numerical properties into categorical form by a thresholding method. Different threshold values are tested. The threshold value that supports the best information gain is selected. After categorizing the feature vector with the selected threshold, ID3 can be directly applied. 16 8

C4.5 Algorithm To determine the best threshold, 1. All numerical data are sorted (a1, a,, an). An average value of each sorted pair is computed by (ai + ai+1)/ 3. We have n-1 average values as thresholds 4. Each average is tested to branch data 5. The best average should support the maximum information gain. 17 Lost Data If some samples are lost, there are two ways to be followed: 1. Samples with lost features are completely removed from the dataset.. The algorithm is arranged to operate with lost data. If the number of lost samples is too much, then the second option should be applied. 18 9

Lost Data When calculating the information gain for the feature vector with lost data, information gain is calculated by excluding the lost samples, and then it is multiplied by F coefficient. F coefficient is the ratio of lost data in dataset. IG( X ) F.( H ( X ) H( V, X )) 19 Lost Data Writing the most frequent value in feature vector with lost data into the lost cells is another one of the proposed methods. 0 10

Overfitting As in other machine learning methods, we should avoid overfitting also in the decision tree. If precautions are not taken, all decision trees make overfitting. Therefore the trees should be pruned during or after creation. 1 Pruning Pruning is process of removing needless parts to the classification from decision tree. In this way, the decision tree becomes both simple and understandable. There are two types of pruning methods; pre-pruning post-pruning 11

Pre-Pruning Pre-pruning is done during creation of trees. If divided features values is less than a threshold value (fault tolerance), the partitioning process is stopped. At that moment a leaf is created, which has the label of the dominant class of the available data set. 3 Post-Pruning Post-pruning is done after creation of tree by creating a leaf instead of a sub-tree, rising a sub-tree, and removing some branches. 4 1

Pruning Sample Humidity Sunny Wheather Cloudy Rainy Wind High No 35 Normal 15 15 No 5 Strong Weak 0 If fault tolerance is 33%, ratio at sub-tree of Humidity node is computed as 30%. Therefore instead of Humidity node, a No leaf is created. 5 Pruning Sample No Sunny Wheather Cloudy Rainy Wind Strong Weak No 6 13

Classification and Regression Trees (CART) Briefly, CART trees divides each node into branches. The most known methods are: Twoing algorithm Gini algorithm 7 MATLAB Application >edit C4_5_ornek.m Student should study with this sample by using given datasets. 8 14

Presentation Task You can choose one of two CART algorithms listed below. Twoing Gini 9 15