Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Similar documents
the tree till a class assignment is reached

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Decision trees COMS 4771

CS145: INTRODUCTION TO DATA MINING

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Machine Learning 3. week

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Introduction to Data Science Data Mining for Business Analytics

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

C4.5 - pruning decision trees

Supervised Learning via Decision Trees

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Machine Learning & Data Mining

Lecture 7 Decision Tree Classifier

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Tufts COMP 135: Introduction to Machine Learning

Lecture 7: DecisionTrees

Decision Tree Learning Lecture 2

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Jeffrey D. Ullman Stanford University

Learning Decision Trees

Learning Decision Trees

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Decision Trees Part 1. Rao Vemuri University of California, Davis

Decision Tree Learning

Introduction to Machine Learning CMU-10701

Generative v. Discriminative classifiers Intuition

Informal Definition: Telling things apart

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Decision Trees. Danushka Bollegala

Data classification (II)

Knowledge Discovery and Data Mining

Machine Learning 2nd Edi7on

Data Mining and Analysis: Fundamental Concepts and Algorithms

Tutorial 2. Fall /21. CPSC 340: Machine Learning and Data Mining

UVA CS 4501: Machine Learning

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning

Decision Trees. Tirgul 5

CHAPTER-17. Decision Tree Induction

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Holdout and Cross-Validation Methods Overfitting Avoidance

Decision Tree Learning

Induction of Decision Trees

Decision Trees Lecture 12

Decision Tree And Random Forest

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

CS 6375 Machine Learning

Artificial Intelligence. Topic

Lecture 3: Decision Trees

Decision Tree Learning

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Statistical Machine Learning from Data

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

Decision Trees. Gavin Brown

Notes on Machine Learning for and

Learning Decision Trees

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Data Mining Project. C4.5 Algorithm. Saber Salah. Naji Sami Abduljalil Abdulhak

Decision Trees: Overfitting

CSC 411 Lecture 3: Decision Trees

Growing a Large Tree

Decision Trees. Introduction. Some facts about decision trees: They represent data-classification models.

EECS 349:Machine Learning Bryan Pardo

Artificial Intelligence Decision Trees

SF2930 Regression Analysis

Midterm, Fall 2003

Decision Trees. CS 341 Lectures 8/9 Dan Sheldon

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Decision Trees Entropy, Information Gain, Gain Ratio

Regression tree methods for subgroup identification I

Classification and Regression Trees

10-701/ Machine Learning: Assignment 1

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Classification: Decision Trees

Generalization to Multi-Class and Continuous Responses. STA Data Mining I

Empirical Risk Minimization, Model Selection, and Model Assessment

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Statistics and learning: Big Data

Information Theory & Decision Trees

day month year documentname/initials 1

Dan Roth 461C, 3401 Walnut

Chapter 14 Combining Models

From inductive inference to machine learning

Final Exam, Machine Learning, Spring 2009

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Decision Tree Learning

Data Mining and Knowledge Discovery: Practice Notes

Decision Tree Learning

Nonlinear Classification

Transcription:

+ Machine Learning and Data Mining Decision Trees Prof. Alexander Ihler

Decision trees Func-onal form f(x;µ): nested if-then-else statements Discrete features: fully expressive (any func-on) Structure: Internal nodes: check feature, branch on value Leaf nodes: output predic-on XOR X? x x 2 y - - X 2? X 2? if X: # branch on feature at root if X2: return + # if true, branch on right child feature else: return - # & return leaf value else: # left branch: if X2: return - # branch on left child feature else: return + # & return leaf value Parameters? Tree structure, features, and leaf outputs

Decision trees Real-valued features Compare feature value to some threshold X >.5?.9.8 X2 >.5?.7.6.5 X >.?.4.3.2...2.3.4.5.6.7.8.9

Decision trees Categorical variables Could have one child per value Binary splits: single values, or by subsets X =? X =? X =? A B C D {A} {B,C,D} {A,D} {B,C} The discrete variable will not appear again below here Could appear again multiple times (This ^^^ is easy to implement using a -of-k representation )

Decision trees Complexity of func-on depends on the depth A depth- decision tree is called a decision stump Simpler than a linear classifier!.9 X >.5?.8.7.6.5.4.3.2...2.3.4.5.6.7.8.9

Decision trees Complexity of func-on depends on the depth More splits provide a finer-grained par--oning X >.5?.9.8.7 X2 >.6? X >.85?.6.5.4.3.2. Depth d = up to 2 d regions & predictions..2.3.4.5.6.7.8.9

Decision trees for regression Exactly the same Predict real valued numbers at leaf nodes Examples on a single scalar feature: Depth = 2 regions & predictions Depth 2 = 4 regions & predictions

Machine Learning and Data Mining Learning Decision Trees Prof. Alexander Ihler

Learning decision trees Break into two parts Should this be a leaf node? If so: what should we predict? If not: how should we further split the data? Leaf nodes: best prediction given this data subset Classify: pick majority class; Regress: predict average value Non-leaf nodes: pick a feature and a split Greedy: score all possible features and splits Score function measures purity of data after split How much easier is our prediction task after we divide the data? When to make a leaf node? All training examples the same class (correct), or indistinguishable Fixed depth (fixed complexity decision boundary) Others Example algorithms: ID3, C4.5 See e.g. wikipedia, Classification and regression tree

Learning decision trees

Scoring decision tree splits Suppose we are considering splirng feature How can we score any par-cular split? Impurity how easy is the predic-on problem in the leaves? Greedy could choose split with the best accuracy Assume we have to predict a value next MSE (regression) / loss (classifica-on) But: sox score can work beyer.9.8.7.6 X > t?.5.4.3.2...2.3.4.5.6.7.8.9 t =?

Entropy and informa-on Entropy is a measure of randomness How hard is it to communicate a result to you? Depends on the probability of the outcomes Communicating fair coin tosses Output: H H T H T T T H H H H T Sequence takes n bits each outcome totally unpredictable Communicating my daily lottery results Output: Most likely to take one bit I lost every day. Small chance I ll have to send more bits (won & when) Takes less work to communicate because it s less random Use a few bits for the most likely outcome, more for less likely ones Lost: Won : ( ) Won 2: ( )( )

Entropy and informa-on Entropy H(x) E[ log /p(x) ] = p(x) log /p(x) Log base two, units of entropy are bits Two outcomes: H = - p log(p) - (-p) log(-p) Examples:.9.9.9.8.8.8.7.7.7.6.6.6.5.5.5.4.4.4.3.3.3.2.2.2... 2 3 4 2 3 4 2 3 4 H(x) =.25 log 4 +.25 log 4 +.25 log 4 +.25 log 4 = log 4 = 2 bits H(x) =.75 log 4/3 +.25 log 4 ¼.833 bits H(x) = log = bits Max entropy for 4 outcomes Min entropy

Entropy and informa-on Information gain How much is entropy reduced by measurement? Information: expected information gain 8 6 4 2 8 6 4 2 2 H =.77 bits Prob = 3/8 2 H =. 99 bits 5 4 3 2 2 H= Prob = 5/8 Information = 3/8 * (.99-.77) + 5/8 * (.99 ).9.8.7.6.5.4.3.2...2.3.4.5.6.7.8.9 Equivalent: p(s,c) log [ p(s,c) / p(s) p(c) ] = /8 log[ (/8) / (3/8) (/8)] + 3/8 log[ (3/8)/(3/8)(8/8) +

Entropy and informa-on Information gain How much is entropy reduced by measurement? Information: expected information gain 8 6 4 2 8 6 4 2 2 H =.97 bits Prob = 7/8 2 H =. 99 bits 5 4 3 2 2 H= Prob = /8.9.8.7.6.5.4.3.2...2.3.4.5.6.7.8.9 Information = 7/8 * (.99-.97) + /8 * (.99 ) Less information reduction a less desirable split of the data

Gini index & impurity An alternative to information gain Measures variance in the allocation (instead of entropy) Hgini = c p(c) (-p(c)) vs. Hent = - c p(c) log p(c) 8 6 4 2 8 6 4 2 2 Hg =.355 Prob = 3/8 2 Hg =. 494 5 4 3 2 2 Hg = Prob = 5/8.9.8.7.6.5.4.3.2...2.3.4.5.6.7.8.9 Gini Index = 3/8 * (.494-.355) + 5/8 * (.494 )

Entropy vs Gini impurity The two are nearly the same Pick whichever one you like H(p) P(y=)

For regression Most common is to measure variance reduction Equivalent to information gain in a Gaussian model Var =.25 Var =. Prob = 4/ Var =.2 Prob = 6/ Var reduction = 4/ * (.25-.) + 6/ * (.25.2)

Scoring decision tree splits

Building a decision tree Stopping conditions: * # of data < K * Depth > D * All data indistinguishable (discrete features) * Prediction sufficiently accurate * Information gain threshold? Often not a good idea! No single split improves, but, two splits do. Better: build full tree, then prune

Controlling complexity Maximum depth cutoff No limit Depth Depth 2 Depth 3 Depth 4 Depth 5

Controlling complexity Minimum # parent data minparent minparent 3 minparent 5 minparent Alternate (similar): min # of data per leaf

Decision trees in python Many implementations Class implementation: real-valued features (can use -of-k for discrete) Uses entropy (easy to extend) T = dt.treeclassify() T.train(X,Y,maxDepth=2) print T if x[] < 5.62476: if x[] < 3.9747: Predict. else: Predict. else: if x[] < 6.86588: Predict. else: Predict 2. # green # blue # green # red ml.plotclassify2d(t, X,Y)

Summary Decision trees Flexible functional form At each level, pick a variable and split condition At leaves, predict a value Learning decision trees Score all splits & pick best Classification: Information gain, Gini index Regression: Stopping criteria Expected variance reduction Complexity depends on depth Decision stumps: very simple classifiers