Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Similar documents
Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

CS145: INTRODUCTION TO DATA MINING

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Machine Learning 2nd Edi7on

Introduction to Machine Learning CMU-10701

the tree till a class assignment is reached

EECS 349:Machine Learning Bryan Pardo

Learning Decision Trees

Classification and Prediction

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Lecture 7 Decision Tree Classifier

Machine Learning 3. week

Statistics and learning: Big Data

Decision Tree Learning Lecture 2

C4.5 - pruning decision trees

Induction of Decision Trees

CS 6375 Machine Learning

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Decision Tree Learning

Lecture 7: DecisionTrees

Learning Decision Trees

Learning Decision Trees

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Classification: Decision Trees

Lecture 3: Decision Trees

Decision Trees. Tirgul 5

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Dan Roth 461C, 3401 Walnut

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Decision Trees. Gavin Brown

Empirical Risk Minimization, Model Selection, and Model Assessment

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

day month year documentname/initials 1

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Machine Learning & Data Mining

Knowledge Discovery and Data Mining

Decision Trees. CS 341 Lectures 8/9 Dan Sheldon

Supervised Learning via Decision Trees

Lecture 3: Decision Trees

Data Mining and Analysis: Fundamental Concepts and Algorithms

Notes on Machine Learning for and

Classification and Regression Trees

Decision Tree. Decision Tree Learning. c4.5. Example

UVA CS 4501: Machine Learning

Decision Trees Entropy, Information Gain, Gain Ratio

Generative v. Discriminative classifiers Intuition

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Learning from Observations. Chapter 18, Sections 1 3 1

Statistical Learning. Philipp Koehn. 10 November 2015

Decision Tree Learning

Classification Using Decision Trees

Rule Generation using Decision Trees

CS 380: ARTIFICIAL INTELLIGENCE

Holdout and Cross-Validation Methods Overfitting Avoidance

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

CHAPTER-17. Decision Tree Induction

1. Courses are either tough or boring. 2. Not all courses are boring. 3. Therefore there are tough courses. (Cx, Tx, Bx, )

Regression tree methods for subgroup identification I

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Lecture 3: Introduction to Complexity Regularization


Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Decision Tree And Random Forest

Decision Tree Learning

Decision trees COMS 4771

Jeffrey D. Ullman Stanford University

IMBALANCED DATA. Phishing. Admin 9/30/13. Assignment 3: - how did it go? - do the experiments help? Assignment 4. Course feedback

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Decision Trees Part 1. Rao Vemuri University of California, Davis

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Informal Definition: Telling things apart

Classification and regression trees

From inductive inference to machine learning

Decision T ree Tree Algorithm Week 4 1

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Decision Trees.

Machine Learning: Symbolische Ansätze. Decision-Tree Learning. Introduction C4.5 ID3. Regression and Model Trees

Learning from Observations

Chapter 6: Classification

Decision Trees: Overfitting

Classification: Decision Trees

Artificial Intelligence Decision Trees

Improving M5 Model Tree by Evolutionary Algorithm

Decision Trees Lecture 12

Learning and Neural Networks

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Statistical Consulting Topics Classification and Regression Trees (CART)

1 Handling of Continuous Attributes in C4.5. Algorithm

Decision Trees. Introduction. Some facts about decision trees: They represent data-classification models.

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Transcription:

Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro

Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2

3

Why Trees? } interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks } model discrete outcomes nicely } can be very powerful, can be as complex as you need them } C4.5 and CART - from top 10 entries on Kaggle - decision trees are very effective and popular 4

Sure, But Why Trees? } Easy to understand knowledge representation } Can handle mixed variables } Recursive, divide and conquer learning method } Efficient inference 5

} Top-down recursive divide and conquer algorithm Start with all examples at root Select best attribute/feature Grow it by splitting attributes one by one. To determine which attribute to split, look at node impurity. Recurse and repeat } Other issues: How to construct features When to stop growing Pruning irrelevant parts of the tree 6

Fraud Age Degree StartYr Series7 + 22 Y 2005 N - 25 N 2003 Y - 31 Y 1995 Y - 27 Y 1999 Y Score each attribute split for these instances: Age, Degree, StartYr, Series7 + 24 N 2006 N - 29 N 2003 N Y choose split on Series7 N Fraud Age Degree StartYr Series7-25 N 2003 Y - 31 Y 1995 Y - 27 Y 1999 Y Y Fraud Age Degree StartYr Series7 + 22 Y 2005 N + 24 N 2006 N - 29 N 2003 N choose split on Age>28 Score N each attribute split for these instances: Age, Degree, StartYr Fraud Age Degree StartYr Series7-29 N 2003 N Fraud Age Degree StartYr Series7 + 22 Y 2005 N + 24 N 2006 N 7

t 1 t 3 X Overview (with 1 X two features and 1D target) 1 X 1 t 1 X 2 t 2 X 1 t 3 Y X 2 t 4 R 1 R 2 R 3 X 2 X 1 R 4 R 5 FIGURE 9.2. Partitions and CART. Top right panel shows a partition of a Features: X two-dimensional feature space by recursive 1, X binary 2 splitting, as usedincart, applied to some fake data. Top Target: left panel Y shows a general partition that cannot be obtained from recursive binary splitting. Bottom left panel shows the tree corresponding to the partition in the top right panel, and a perspective plot of the prediction surface appears in the bottom right panel. 8

} Most well-known systems CART: Breiman, Friedman, Olshen and Stone ID3, C4.5: Quinlan } How do they differ? Split scoring function Stopping criterion Pruning mechanism Predictions in leaf nodes 9

10

} Idea: a good feature splits the examples into subsets that distinguish among the class labels as much as possible... ideally into pure sets of "all positive" or "all negative" Patrons? Type? None Some Full French Italian Thai Burger } Bias-variance tradeoff: Choosing most discriminating attribute first may not be best tree (bias), but it can make tree small (low variance). 11

Association between attribute and class label Data Income High Med Low no no yes yes yes no yes yes yes no yes no yes yes Contingency table Class label value Buy No buy High 2 2 Med 4 2 Low 3 1 12

Mathematically Defining Good Split } We start with information theory How uncertain of the answer will be if we split the tree this way? } Say need to decide between k options Uncertainty in the answer Y {1,,k} when probability is p 1,..., p k can be quantified via entropy: H(hp 1,...,p k i)= X i p i log 2 p i } Convenient notation: B(p) = H( p, 1 p ), number of bits necessary to encode 13

Amount of Information in the Tree } Suppose we have p positive and n negative examples at the root B(p/(p + n)) bits needed to classify a new example } Information is always conserved If encoding the information in the leaves is lossless then tree has lossless encoding The entropy of the leaves (amount of bits) + the tree information (bits) carried in the tree = total information in the data } Let split Y i have p i positive and n i negative examples B(p i /(p i + n i ) bits needed to classify a new example expected number of bits per example over all branches is X i p i + n i p + n B(p i/(p i + n i )) choose the next attribute to split that minimizes the remaining information needed Which maximizes the information in the tree (as information is conserved) 14

Information gain } Information Gain (Gain) is the amount of information that the tree structure encodes H[x] is the entropy: expected number of bits to encode a randomly selected x Gain(S, A) =H[S] X A A A S H[A] H[buys_computer] = -9/14 log 9/14-5/14 log 5/14 = 0.9400 15

Gain(S, A) =H[S] X A A A S H[A] High no no yes yes Income yes no yes yes yes no Med Low yes no yes yes Entropy(Income=high) = -2/4 log 2/4-2/4 log 2/4 = 1 Entropy(Income=med) = -4/6 log 4/6-2/6 log 2/6 = 0.9183 Entropy(Income=low) = -3/4 log 3/4-1/4 log 1/4 = 0.8113 Gain(D,Income) = 0.9400 - (4/14 [1] + 6/14 [0.9183] + 4/14 [0.8113]) = 0.029 16

Gini gain } Similar to information gain } Uses gini index instead of entropy } Measures decrease in gini index after split: Gain(S, A) = Gigi(S) X A A A S Gini(A) 17

Entropy Gini 18

Entropy Gini 19

Entropy Entropy Gini Gini x 2 Gini score can produce larger gain 20

Chi-Square score } Widely used to test independence between two categorical attributes (e.g., feature and class label) } Considers counts in a contingency table and calculates the normalized squared deviation of observed (predicted) values from expected (actual) values 2 kx i=1 (o i e i ) 2 } Sampling distribution is known to be chi-square distributed, given that cell counts are above minimum thresholds e i 21

Buy No buy High 2 2 Med 4 2 Low 3 1 4 6 4 9 5 14 22

Class 2 kx i=1 (o i e i ) 2 e i Attribute + 0 a b 1 c d N 23

Observed Expected Buy No buy Buy No buy High 2 2 Med 4 2 Low 3 1 High 2.57 1.43 Med 3.86 2.14 Low 2.57 1.43 24

Tree learning } Top-down recursive divide and conquer algorithm Start with all examples at root Select best attribute/feature Partition examples by selected attribute Recurse and repeat } Other issues: How to construct features When to stop growing Pruning irrelevant parts of the tree 25

Overfitting } Consider a distribution D of data representing a population and a sample DS drawn from D, which is used as training data } Given a model space M, a score function S, and a learning algorithm that returns a model m M, the algorithm overfits the training data DS if: m' M such that S(m,DS) > S(m',DS) but S(m,D) < S(m',D) In other words, there is another model (m ) that is better on the entire distribution and if we had learned from the full data we would have selected it instead 26

Knowledge representation: If-then rules Example rule: If x > 25 then + Else - + What is the model space? All possible thresholds - What score function? Prediction error rate X 27

Search procedure? Small sample Biased sample Best result on full data Try all thresholds, select one with lowest score Full data Large sample Overfitting threshold 28

Approaches to avoid overfitting } Regularization (Priors) e.g.,dirichlet prior in NBC } Hold out evaluation set, used to adjust structure of learned model e.g., pruning in decision trees } Statistical tests during learning to only include structure with significant associations e.g., pre-pruning in decision trees } Penalty term in classifier scoring function i.e., change scorre function to prefer simpler models 29

How to avoid overfitting in decision trees } Postpruning Use a separate set of examples to evaluate the utility of pruning nodes from the tree (after tree is fully grown) } Prepruning Apply a statistical test to decide whether to expand a node Use an explicit measure of complexity to penalize large trees (e.g., Minimum Description Length) 30

Algorithm comparison } CART Evaluation criterion: Gini index Search algorithm: Simple to complex, hill-climbing search Stopping criterion: When leaves are pure Pruning mechanism: Cross-validation to select gini threshold } C4.5 Evaluation criterion: Information gain Search algorithm: Simple to complex, hill-climbing search Stopping criterion: When leaves are pure Pruning mechanism: Reduce error pruning 31

Example: reduced error pruning } } Use pruning set to estimate accuracy in sub-trees and for individual nodes Let T be a sub-tree rooted at node v v } Define: T } } Repeat: Prune at node with largest gain until until only negative gain nodes remain Bottom-up restriction : T can only be pruned if it does not contain a sub-tree with lower error than T Source: www.ailab.si/blaz/predavanja/uisp/slides/uisp05-postpruning.ppt 32

Pre-pruning methods } Stop growing tree at some point during top-down construction when there is no longer sufficient data to make reliable decisions } Approach: Choose threshold on feature score Stop splitting if the best feature score is below threshold 33

Determine chi-square threshold analytically } Stop growing when chi-square feature score is not statistically significant } Chi-square has known sampling distribution, can look up significance threshold Degrees of freedom= kx 2 (o i e i ) 2 (#rows-1)(#cols-1) e i=1 i 2X2 table: 3.84 is 95% critical value 34

K-fold cross validation } Randomly partition training data into k folds } For i=1 to k Learn model on D - i th fold; evaluate model on i th fold } Average results from all k trials Train1 Train2 Train3 Train4 Train5 Train6 Dataset Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Test1 Test2 Test3 Test4 Test5 Test6 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 35

Choosing a Gini threshold with cross validation } For i in 1.. k For t in threshold set (e.g, [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]) Learn decision tree on Traini with Gini gain threshold t (i.e. stop growing when max Gini gain is less than t) Evaluate learned tree on Testi (e.g., with accuracy) Set tmax,i to be the t with best performance on Testi } Set tmax to the average of tmax,i over the k trials } Relearn the tree on all the data using tmax as Gini gain threshold 36

Modification of score function to better represent model value Small sample Full data Biased sample Large sample threshold 37