brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Similar documents
CS 6375 Machine Learning

Learning from Observations. Chapter 18, Sections 1 3 1

Introduction to Artificial Intelligence. Learning from Oberservations

Statistical Learning. Philipp Koehn. 10 November 2015

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

CS 380: ARTIFICIAL INTELLIGENCE

Lecture 3: Decision Trees

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Decision Tree Learning

Classification: Decision Trees

From inductive inference to machine learning

Decision Tree Learning

Lecture 3: Decision Trees

16.4 Multiattribute Utility Functions

Notes on Machine Learning for and

Decision Tree Learning and Inductive Inference

the tree till a class assignment is reached

Learning Decision Trees

Decision Trees (Cont.)

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Learning and Neural Networks

1. Courses are either tough or boring. 2. Not all courses are boring. 3. Therefore there are tough courses. (Cx, Tx, Bx, )

CSCI 5622 Machine Learning

Decision Tree Learning

Decision Trees.

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Decision Trees. None Some Full > No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. Patrons? WaitEstimate? Hungry? Alternate?

Decision Trees. Tirgul 5

EECS 349:Machine Learning Bryan Pardo

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

CSC 411 Lecture 3: Decision Trees

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Decision Tree Learning Lecture 2

Learning Decision Trees

Decision Trees.

Learning Decision Trees

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Machine Learning & Data Mining

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Bayesian learning Probably Approximately Correct Learning

Classification: Decision Trees

Learning Classification Trees. Sargur Srihari

Decision Tree Learning

Chapter 3: Decision Tree Learning

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Classification and Prediction

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Chapter 6: Classification

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Decision Trees. CS 341 Lectures 8/9 Dan Sheldon

Machine Learning 2nd Edi7on

Decision Trees Entropy, Information Gain, Gain Ratio

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

15-381: Artificial Intelligence. Decision trees

Symbolic methods in TC: Decision Trees

Decision Tree Learning

Artificial Intelligence Roman Barták

Empirical Risk Minimization, Model Selection, and Model Assessment

Introduction to Machine Learning

Chapter 3: Decision Tree Learning (part 2)

Decision Tree Learning - ID3

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Machine Learning

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Lecture 7: DecisionTrees

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

Jeffrey D. Ullman Stanford University

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Data classification (II)

Typical Supervised Learning Problem Setting

Knowledge Discovery and Data Mining

MODULE -4 BAYEIAN LEARNING

Decision Trees. Gavin Brown

Machine Learning 2010

Classification and Regression Trees

CHAPTER-17. Decision Tree Induction

CS145: INTRODUCTION TO DATA MINING

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Machine Learning

Machine Learning

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Dan Roth 461C, 3401 Walnut

} It is non-zero, and maximized given a uniform distribution } Thus, for any distribution possible, we have:

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

Decision Trees Part 1. Rao Vemuri University of California, Davis

Tutorial 6. By:Aashmeet Kalra

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Artificial Intelligence Decision Trees

Classification and regression trees

Decision Trees. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Decision Tree Learning

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

C4.5 - pruning decision trees

Machine Learning (CS 419/519): M. Allen, 14 Sept. 18 made, in hopes that it will allow us to predict future decisions

Introduction to Statistical Learning Theory. Material para Máster en Matemáticas y Computación

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Information Theory & Decision Trees

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Transcription:

I Decision Tree Learning Part I brainlinksystem.com $25+ / hr Illah Nourbakhsh s version Chapter 8, Russell and Norvig Thanks to all past instructors Carnegie Mellon Outline Learning and philosophy Induction versus Deduction Decision trees: introduction Learning How do we define learning? Gathering more knowledge Knowing more than was known before learning Learning substitutes the need to model a priori Experience, feedback, refinement Learning modifies the agent's decision mechanisms to improve performance

Learning Learning agents Declarative versus Procedural Knowledge Explicit versus Implicit Knowledge Induction versus Deduction Learning Element bit of magic : Which components of the performance element are to be learned What feedback is available to learn these components What representation is used for the components Type of feedback: Supervised learning: correct labels/answers Unsupervised learning: {correct labels/answers} missing Reinforcement learning: occasional rewards Inductive Learning Simplest form: learn a function from examples f is the target function n example is a pair (x, f(x)) supervised learning Problem: find a hypothesis h such that h f given a training set of examples 2

Inductive Learning Method Construct/adjust h to agree with f on training set h is consistent if it agrees with f on all examples E.g., curve fitting: Inductive Learning Method Construct/adjust h to agree with f on training set h is consistent if it agrees with f on all examples E.g., curve fitting (good idea? Bad idea?): Inductive Learning Method Construct/adjust h to agree with f on training set h is consistent if it agrees with f on all examples E.g., curve fitting: Inductive Learning Method Construct/adjust h to agree with f on training set h is consistent if it agrees with f on all examples E.g., curve fitting: 3

Inductive Learning Method Construct/adjust h to agree with f on training set h is consistent if it agrees with f on all examples The dangers of overfitting Inductive Learning Method - Bias Hypothesis form bias on the learning outcome Occam s razor: prefer the simplest hypothesis consistent with data ttribute-based Data Sets Examples described by attribute values (Boolean, discrete, continuous) Function is class, wait/not wait for table at restaurant Classification of examples is positive (T) or negative (F) ttribute-based Data Set Type: drama, comedy, thriller Company: MGM, Columbia Director: Bergman, Spielberg, Hitchcock Mood: stressed, relaxed, normal Movie Type Company Director Mood Likes-movie? m thriller MGM Bergman normal No m 2 comedy Columbia Spielberg stressed Yes m 3 comedy MGM Spielberg relaxed No m thriller MGM Bergman relaxed No m 5 comedy MGM Hitchcock normal Yes m 6 drama Columbia Bergman relaxed Yes m 7 drama Columbia Bergman normal No m 8 drama MGM Spielberg stressed No m 9 drama MGM Hitchcock normal Yes m 0 comedy Columbia Spielberg relaxed No m thriller MGM Spielberg normal No m thriller Columbia Hitchcock relaxed No

Input Data ttributes X =x Classifier Class prediction Y = y Data Representation - Decision Trees One possible representation for hypotheses E.g., here is the true tree for deciding whether to wait: X M =x M Training data Expressiveness Decision trees can express any function of the input attributes E.g., for Boolean functions, truth table row path to leaf Trivially, there is a consistent decision tree for any training set with one path to leaf for each example (unless f nondeterministic in x) But it probably won't generalize to new examples goal of learning Goal: Find more compact decision trees Building a Decision Tree Three variables: Hair = {blond, dark} Height = {tall,short} Country = {Gromland, Polvia} the output! Training data: (B,T,P) P:2 G: (B,T,P) Hair = B? Hair = D? (B,S,G) (D,S,G) (D,T,G) P:2 G:2 P:0 G:2 (B,S,G) Height = T? Height = S? P:2 G:0 P:0 G:2 5

t each level of the tree, we split the data according to the value of one of the attributes Height = T? P:2 G:0 Hair = B? P:2 G:2 P:2 fter G: enough splits, only one output class is represented Hair in = the D? node a pure node a terminal leaf of the tree. Height = S? P:0 G:2 P:0 G:2 G is the output for this node Height = T? P:2 G:0 P:2 G:2 P:2 G: Hair = B? Hair = D? Height = S? P:0 G:2 P:0 G:2 The class of a new input can be classified by following the tree all the way down to a leaf and by reporting the output of the leaf. For example: (B,T) is classified as P; (D,S) is classified as G X = first possible value for X? Output class Y = y General Decision Tree X = nth possible value for X? X j = ith possible value for X j? Basic Questions How to choose the attribute to split on at each level of the tree? When to stop splitting? When should a node be declared a leaf? If a leaf node is impure, how should the class label be assigned? If the tree is too large, how can it be pruned? Output class Y = y c 6

Set of Training Examples Type: drama, comedy, thriller Company: MGM, Columbia Director: Bergman, Spielberg, Hitchcock Mood: stressed, relaxed, normal Movie Type Company Director Mood Likes-movie? m thriller MGM Bergman normal No m 2 comedy Columbia Spielberg stressed Yes m 3 comedy MGM Spielberg relaxed No m thriller MGM Bergman relaxed No m 5 comedy MGM Hitchcock normal Yes m 6 drama Columbia Bergman relaxed Yes m 7 drama Columbia Bergman normal No m 8 drama MGM Spielberg stressed No m 9 drama MGM Hitchcock normal Yes m 0 comedy Columbia Spielberg relaxed No m thriller MGM Spielberg normal No m thriller Columbia Hitchcock relaxed No Choosing the Best ttribute Ideal attribute partitions examples into all positive or all negative (or from the same class in each partition). ttribute that results in higher discrimination. How good is the attribute Type of movie? Entropy (Shannon and Weaver 99) Entropy (Shannon and Weaver 99) Frequency of occurrence High Entropy Class Number The entropy captures the degree of purity of the data distribution In general, entropy H is the average number of bits necessary to encode n values: n H = log2 i= P i P i Frequency of occurrence Low Entropy Class Number P i = probability of occurrence of value i High entropy ll the classes are (nearly) equally likely; low information Low entropy few classes are likely; most of the classes are rarely observed; high level of information gain 7

Entropy for ttribute Choice Measure of information provided by the attribute Entropy of a set of examples S as the information content of S. where p Entropy(S) = = i = S i S c i - pi log2 p i Entropy Unit: bit of information = the information content of the actual answer when there are two possible answers equally probable. E( S) = log2 log 2 2 2 2 2 = c classes, Si size of the data set for class i Entropy Relative Boolean Classification Entropy Example S =, c = 2(+,-), S + =, S - =8 E( S) = = log 0.98 2 8 log 2 8 8

Choosing the Best ttribute ttribute with highest information gain expected reduction in entropy of the set S if partitioned according to attribute. values( ) Sv i Gain( ) = Entropy( S) - Entropy( Sv i S, = i S ) Other ttributes 8 ( E,8 ) E(2,6 ) = 0.07 3 E(,3 ) ( E,8 ) E(,2 ) E(, ) = 0.055 5 E(2,2 ) Gain ( S,Type ) = E(,8 ) = 0.252 E(0, ) E(2,2 ) E(2,2 ) ( E,8 ) E(,3 ) E(, ) = 0.8 5 3 E(2, ) Commitment to One Hypothesis Split Data and Continue What is the next best attribute? Recursive. Gain( S,Mood) d Gain( S,Company) d Gain( S, Director) d + = E(2,2 ) = 0.5 = E(2,2 ) = 0 = E(2,2 ) = 0.5 2 2 + E(0, ) E(, ) E(, ) 2 + E(,0 ) E(, ) E(0, ) 2 2 + E(, ) E(,0 ) 9

Learned Tree Continuous Case- How to Split? Two classes (red circles/green crosses) Two attributes: X and points in training data Idea Construct a decision tree such that the leaf nodes predict correctly the class for all the training examples Good and Bad Splits Good Bad This node is pure because This node is Good there is only almost pure one class left Little ambiguity No ambiguity in in the class the class label label Bad These nodes contain a mixture of classes Do not disambiguate between the classes 0

H = 0.99 H = 0.99 We want to find the most compact, smallest size tree (Occam s razor), that classifies the training data correctly We want to find the split choices that will get us the fastest to pure nodes IG = H (H L * / + H R * 7/) IG = H (H L * 5/ + H R * 6/) This node is pure because there is only one class left No ambiguity in the class label Good This node is almost pure Little ambiguity in the class label Bad These nodes contain a mixture of classes Do not disambiguate between the classes H L = 0 H R = 0.58 H L = 0.97 H R = 0.92 H = 0.99 H = 0.99 H = 0.99 H = 0.99 Choose this split because the information gain is greater than with the other split IG = 0.62 IG = 0.052 IG = 0.62 IG = 0.052 H L = 0 H R = 0.58 H L = 0.97 H R = 0.92 H L = 0 H R = 0.58 H L = 0.97 H R = 0.92

More Complete Example IG = 20 training examples from class = 20 training examples from class B ttributes = X and coordinates X Split value Best split value (max Information Gain) for X attribute: 0.2 with IG = 0.38 IG Best X split: 0.2, IG = 0.38 Best split: 0.23, IG = 0.202 Split on with 0.23 Split value Best split value (max Information Gain) for attribute: 0.23 with IG = 0.202

Best X split: 0.2, IG = 0.38 Best There Y split: is no point 0.23, in splitting IG = 0.202 this node further since it contains only data from a single class return it as a leaf Split node on with Y output with 0.23 IG This node is not pure so we need to split further X Split value Best split value (max Information Gain) for X attribute: 0.22 with IG ~ 0.82 IG Best X split: 0.22, IG = 0.82 Best split: 0.75, IG = 0.353 Split on with 0.75 Split value Best split value (max Information Gain) for attribute: 0.75 with IG ~ 0.353 3

Best X split: 0.22, IG = 0.82 There is no point in splitting Best this Y node split: further 0.75, since IG = it 0.353 contains only data from a single class return it as a leaf node with output Split on X with 0.5 B B Final decision tree B Final decision tree Given an input (X,Y) Follow the tree down to a leaf. Return corresponding output class for this leaf X Example (X,Y) = (0.5,0.5) X Each of the leaf nodes is pure contains data from only one class X X

Induction of Decision Trees ID3 (examples, attributes) if there are no examples then return default else if all examples are members of the same class then return the class else if there are no attributes then return Most_Common_Class (examples) else best_attribute Choose_Best (examples, attributes) root Create_Node_Test (best_attribute) for each value v i of best_attribute examples vi subset of examples with best_attribute = v i subtree ID3(examples vi, attributes - best_attribute) set subtree as a child of root with label v i return root Basic Questions How to choose the attribute/value to split on at each level of the tree? When to stop splitting? When should a node be declared a leaf? If a leaf node is impure, how should the class label be assigned? If the tree is too large, how can it be pruned? Comments on Tree Termination Zero entropy in data set, perfect classification Type = thriller, 0+ - Type = drama, Mood = stressed, 0+ - Type = drama, Mood = relaxed, + 0- Type = drama, Mood = normal, Director = Bergman, 0+ - Type = drama, Mood = normal, Director = Hitchcock, + 0- No examples availables Type = drama, Mood = normal, Director = Spielberg Indistinguishable data Type = comedy, attribute Company: m2, comedy, Columbia, Yes, and m0, comedy, Columbia, No m3, comedy, MGM, No, and m5, comedy, MGM, Yes Type = comedy, attribute Director: m2, Spielberg, Yes, and m3, Spielberg, No m5, Hitchcock, Yes, and m0, Spielberg, No (Type = comedy, attribute Mood, could have split?). Errors Split labeled data D into training and validation sets Training set error - fraction of training examples for which the decision tree disagrees with the true classification Validation set error - fraction of testing examples - from given labeled examples - for which the decision tree disagrees with the true classification 5

Overfitting Tree too specialized perfectly fit the training data. tree T overfits the training data, if there is an alternate T such that for training set, error with T < error with T for complete D, error with T > error with T Node Pruning Rule Post-pruning Cross validation Reduced-Error Pruning Remove subtree, make node a leaf node, assign the most common classification of the examples of that node. Check validation set error Continue pruning until error does not increase with respect to the error of the unpruned tree Rule post-pruning: represent tree as set of rules; remove rules independently; check validation set error; stop with same criteria ttributes with many values Other Issues Information gain may select it. Consider splitting value and use the ratio between Gain and the splitting value ttributes with cost Use other metrics ttributes with missing values Infer most common value from examples at node Summary Inductive learning supervised learning classification Decision trees represent hypotheses DT learning driven by information gain heuristic Recursive algorithm to build decision tree Next class: More on continuous values Missing, noisy attributes Overfitting, pruning Different attribute selection criteria 6