Decision Trees. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Similar documents
Classification: Decision Trees

Inductive Learning. Chapter 18. Why Learn?

Inductive Learning. Chapter 18. Material adopted from Yun Peng, Chuck Dyer, Gregory Piatetsky-Shapiro & Gary Parker

Notes on Machine Learning for and

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Learning Decision Trees

Machine Learning 2nd Edi7on

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Learning Decision Trees

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

the tree till a class assignment is reached

Lecture 3: Decision Trees

Decision Tree Learning

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Decision Tree Learning - ID3

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Decision Trees.

Classification Based on Probability

CSC 411 Lecture 3: Decision Trees

Machine Learning

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Introduction to Machine Learning CMU-10701

Decision Tree Learning

CS 6375 Machine Learning

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Decision Tree Learning and Inductive Inference

Decision Trees.

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Machine Learning

Decision Tree Learning Lecture 2

Decision Tree Learning

Dan Roth 461C, 3401 Walnut

Lecture 3: Decision Trees

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Induction of Decision Trees

EECS 349:Machine Learning Bryan Pardo

Chapter 3: Decision Tree Learning

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Decision Trees. Tirgul 5

Logis&c Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

CHAPTER-17. Decision Tree Induction

CS145: INTRODUCTION TO DATA MINING

Learning Classification Trees. Sargur Srihari

Learning Decision Trees

Decision Trees. Common applications: Health diagnosis systems Bank credit analysis

Decision Tree Learning

Decision Tree Learning. Dr. Xiaowei Huang

Machine Learning (CS 419/519): M. Allen, 14 Sept. 18 made, in hopes that it will allow us to predict future decisions

Decision Trees Lecture 12

Decision Tree Learning

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

From inductive inference to machine learning

Decision Trees. None Some Full > No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. Patrons? WaitEstimate? Hungry? Alternate?

Machine Learning

Typical Supervised Learning Problem Setting

Learning from Observations. Chapter 18, Sections 1 3 1

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Introduction to Artificial Intelligence. Learning from Oberservations

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Decision Tree. Decision Tree Learning. c4.5. Example

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Generative v. Discriminative classifiers Intuition

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Machine Learning Recitation 8 Oct 21, Oznur Tastan

1. Courses are either tough or boring. 2. Not all courses are boring. 3. Therefore there are tough courses. (Cx, Tx, Bx, )

Decision Tree Learning

Artificial Intelligence. Topic

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

Tutorial 6. By:Aashmeet Kalra

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Rule Generation using Decision Trees

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

CS 380: ARTIFICIAL INTELLIGENCE

} It is non-zero, and maximized given a uniform distribution } Thus, for any distribution possible, we have:

Classification and Prediction

Decision Trees. Danushka Bollegala

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Symbolic methods in TC: Decision Trees

Empirical Risk Minimization, Model Selection, and Model Assessment

Learning and Neural Networks

Classification Using Decision Trees

Machine Learning 3. week

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

Name (NetID): (1 Point)

Decision Trees / NLP Introduction

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

16.4 Multiattribute Utility Functions

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Transcription:

Decision Trees These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these slides for your own academic purposes, provided that you include proper ahribuion. Please send comments and correcions to Eric. Robot Image Credit: Viktoriya Sukhanova 123RF.com

FuncIon ApproximaIon Problem Se*ng Set of possible instances X Set of possible labels Y Unknown target funcion f : X! Y Set of funcion hypotheses H = {h h : X! Y} Input: Training examples of unknown target funcion f {hx i,y i i} n i=1 = {hx 1,y 1 i,...,hx n,y n i} Output: Hypothesis h 2 H that best approximates f Based on slide by Tom Mitchell

Sample Dataset Columns denote features X i Rows denote labeled instances hx i,y i i Class label denotes whether a tennis game was played hx i,y i i

Decision Tree A possible decision tree for the data: Each internal node: test one ahribute X i Each branch from a node: selects one value for X i Each leaf node: predict Y (or p(y x 2 leaf) ) Based on slide by Tom Mitchell

Decision Tree A possible decision tree for the data: What predicion would we make for <outlook=sunny, temperature=hot, humidity=high, wind=weak>? Based on slide by Tom Mitchell

Decision Tree If features are coninuous, internal nodes can test the value of a feature against a threshold 6

Decision Tree Learning Problem Setting: Set of possible instances X each instance x in X is a feature vector e.g., <Humidity=low, Wind=weak, Outlook=rain, Temp=hot> Unknown target function f : X Y Y is discrete valued Set of function hypotheses H={ h h : X Y } each hypothesis h is a decision tree trees sorts x to leaf, which assigns y Slide by Tom Mitchell

Stages of (Batch) Machine Learning Given: labeled training data X, Y = {hx i,y i i} n i=1 Assumes each x i D(X ) with y i = f target (x i ) Train the model: model ß classifier.train(x, Y ) x X, Y learner model y predicion Apply the model to new data: Given: new unlabeled instance y predicion ß model.predict(x) x D(X )

Example ApplicaIon: A Tree to Predict Caesarean SecIon Risk Based on Example by Tom Mitchell

Decision Tree Induced ParIIon Color green blue red Size + Shape big small square round - + Size + big small - +

Decision Tree Decision Boundary Decision trees divide the feature space into axisparallel (hyper-)rectangles Each rectangular region is labeled with one label or a probability distribuion over labels Decision boundary 11

Expressiveness Decision trees can represent any boolean funcion of the input ahributes Truth table row à path to leaf In the worst case, the tree will require exponenially many nodes

Based on slide by Pedro Domingos Expressiveness Decision trees have a variable-sized hypothesis space As the #nodes (or depth) increases, the hypothesis space grows Depth 1 ( decision stump ): can represent any boolean funcion of one feature Depth 2: any boolean fn of two features; some involving three features (e.g., (x 1 ^ x 2 ) _ ( x 1 ^ x 3 ) ) etc.

Another Example: Restaurant Domain (Russell & Norvig) Model a patron s decision of whether to wait for a table at a restaurant ~7,000 possible cases

A Decision Tree from IntrospecIon Is this the best decision tree?

Preference bias: Ockham s Razor Principle stated by William of Ockham (1285-1347) non sunt mul0plicanda en0a praeter necessitatem eniies are not to be muliplied beyond necessity AKA Occam s Razor, Law of Economy, or Law of Parsimony Idea: The simplest consistent explanaion is the best Therefore, the smallest decision tree that correctly classifies all of the training examples is best Finding the provably smallest decision tree is NP-hard...So instead of construcing the absolute smallest tree consistent with the training examples, construct one that is prehy small

Basic Algorithm for Top-Down InducIon of Decision Trees [ID3, C4.5 by Quinlan] node = root of decision tree Main loop: 1. A ß the best decision ahribute for the next node. 2. Assign A as decision ahribute for node. 3. For each value of A, create a new descendant of node. 4. Sort training examples to leaf nodes. 5. If training examples are perfectly classified, stop. Else, recurse over new leaf nodes. How do we choose which ahribute is best?

Choosing the Best AHribute Key problem: choosing which ahribute to split a given set of examples Some possibiliies are: Random: Select any ahribute at random Least-Values: Choose the ahribute with the smallest number of possible values Most-Values: Choose the ahribute with the largest number of possible values Max-Gain: Choose the ahribute that has the largest expected informa0on gain i.e., ahribute that results in smallest expected size of subtrees rooted at its children The ID3 algorithm uses the Max-Gain method of selecing the best ahribute

Choosing an AHribute Idea: a good ahribute splits the examples into subsets that are (ideally) all posiive or all negaive Which split is more informaive: Patrons? or Type? Based on Slide from M. desjardins & T. Finin

ID3-induced Decision Tree Based on Slide from M. desjardins & T. Finin

Compare the Two Decision Trees Based on Slide from M. desjardins & T. Finin

InformaIon Gain Which test is more informaive? Split over whether Balance exceeds 50K Split over whether applicant is employed Less or equal 50K Over 50K Unemployed Employed Based on slide by Pedro Domingos 22

InformaIon Gain Impurity/Entropy (informal) Measures the level of impurity in a group of examples Based on slide by Pedro Domingos 23

Impurity Very impure group Less impure Minimum impurity Based on slide by Pedro Domingos 24

Entropy: a common way to measure impurity Entropy Entropy H(X) of a random variable X # of possible values for X H(X) is the expected number of bits needed to encode a randomly drawn value of X (under most efficient code) Slide by Tom Mitchell

Entropy: a common way to measure impurity Entropy Entropy H(X) of a random variable X # of possible values for X H(X) is the expected number of bits needed to encode a randomly drawn value of X (under most efficient code) Why? Information theory: Most efficient code assigns -log 2 P(X=i) bits to encode the message X=i So, expected number of bits to code one random X is: Slide by Tom Mitchell

Example: Huffman code In 1952 MIT student David Huffman devised, in the course of doing a homework assignment, an elegant coding scheme which is opimal in the case where all symbols probabiliies are integral powers of 1/2. A Huffman code can be built in the following manner: Rank all symbols in order of probability of occurrence Successively combine the two symbols of the lowest probability to form a new composite symbol; eventually we will build a binary tree where each node is the probability of all nodes beneath it Trace a path to each leaf, noicing direcion at each node Based on Slide from M. desjardins & T. Finin

Huffman code example M P A.125 B.125 C.25 D.5 A.125.25 1 0 1.5.5 0 1 0 1.125.25 B C D M code length prob A 000 3 0.125 0.375 B 001 3 0.125 0.375 C 01 2 0.250 0.500 D 1 1 0.500 0.500 average message length 1.750 If we use this code to many messages (A,B,C or D) with this probability distribuion, then, over Ime, the average bits/message should approach 1.75 Based on Slide from M. desjardins & T. Finin

Entropy 2-Class Cases: H(x) = What is the entropy of a group in which all examples belong to the same class? entropy = - 1 log 2 1 = 0 not a good training set for learning nx i=1 P (x = i) log 2 P (x = i) Minimum impurity What is the entropy of a group with 50% in either class? entropy = -0.5 log 2 0.5 0.5 log 2 0.5 =1 Maximum impurity good training set for learning Based on slide by Pedro Domingos 30

Slide by Tom Mitchell Sample Entropy

InformaIon Gain We want to determine which ahribute in a given set of training feature vectors is most useful for discriminaing between the classes to be learned. InformaIon gain tells us how important a given ahribute of the feature vectors is. We will use it to decide the ordering of ahributes in the nodes of a decision tree. Based on slide by Pedro Domingos 32

From Entropy to InformaIon Gain Entropy H(X) of a random variable X Slide by Tom Mitchell

From Entropy to InformaIon Gain Entropy H(X) of a random variable X Specific conditional entropy H(X Y=v) of X given Y=v : Slide by Tom Mitchell

From Entropy to InformaIon Gain Entropy H(X) of a random variable X Specific conditional entropy H(X Y=v) of X given Y=v : Conditional entropy H(X Y) of X given Y : Slide by Tom Mitchell

From Entropy to InformaIon Gain Entropy H(X) of a random variable X Specific conditional entropy H(X Y=v) of X given Y=v : Conditional entropy H(X Y) of X given Y : Mututal information (aka Information Gain) of X and Y : Slide by Tom Mitchell

InformaIon Gain Information Gain is the mutual information between input attribute A and target variable Y Information Gain is the expected reduction in entropy of target variable Y for data sample S, due to sorting on variable A Slide by Tom Mitchell

CalculaIng InformaIon Gain InformaOon Gain = entropy(parent) [average entropy(children)] child entropy 13 13 4 4 log2 log2 = 17 17 17 17 0.787 Entire population (30 instances) 17 instances child entropy 1 13 1 12 12 log2 log2 = 13 13 13 0.391 parent 14 14 16 16 log2 log2 = 0.996 entropy 30 30 30 30 13 instances 17 13 (Weighted) Average Entropy of Children = 0.787 + 0.391 = 0. 615 30 30 Information Gain= 0.996-0.615 = 0.38 Based on slide by Pedro Domingos 38

Entropy-Based AutomaIc Decision Tree ConstrucIon Training Set X x1=(f11,f12, f1m) x2=(f21,f22, f2m).. xn=(fn1,f22, f2m) Node 1 What feature should be used? What values? Quinlan suggested informaion gain in his ID3 system and later the gain raio, both based on entropy. Based on slide by Pedro Domingos 39

Using InformaIon Gain to Construct a Decision Tree Full Training Set X Construct child nodes for each value of A. Each has an associated subset of vectors in which A has a paricular value. Set X v1 repeat recursively Ill when? AHribute A Disadvantage of informaion gain: It prefers ahributes with large number of values that split the data into small, pure subsets Quinlan s gain raio uses normalizaion to improve this v2 vk X ={x X value(a)=v1} Choose the ahribute A with highest informaion gain for the full training set at the root of the tree. Based on slide by Pedro Domingos 40

Slide by Tom Mitchell

Slide by Tom Mitchell

Slide by Tom Mitchell

Decision Tree Applet hhp://webdocs.cs.ualberta.ca/~aixplore/learning/ DecisionTrees/Applet/DecisionTreeApplet.html

Which Tree Should We Output? ID3 performs heuristic search through space of decision trees It stops at smallest acceptable tree. Why? Occam s razor: prefer the simplest hypothesis that fits the data Slide by Tom Mitchell

The ID3 algorithm builds a decision tree, given a set of non-categorical attributes C1, C2,.., Cn, the class attribute C, and a training set T of records function ID3(R:input attributes, C:class attribute, S:training set) returns decision tree; If S is empty, return single node with value Failure; If every example in S has same value for C, return single node with that value; If R is empty, then return a single node with most frequent of the values of C found in examples S; # causes errors -- improperly classified record Let D be attribute with largest Gain(D,S) among R; Let {dj j=1,2,.., m} be values of attribute D; Let {Sj j=1,2,.., m} be subsets of S consisting of records with value dj for attribute D; Return tree with root labeled D and arcs labeled d1..dm going to the trees ID3(R-{D},C,S1)... ID3(R-{D},C,Sm); Based on Slide from M. desjardins & T. Finin

How well does it work? Many case studies have shown that decision trees are at least as accurate as human experts. A study for diagnosing breast cancer had humans correctly classifying the examples 65% of the Ime; the decision tree classified 72% correct BriIsh Petroleum designed a decision tree for gas-oil separaion for offshore oil pla}orms that replaced an earlier rule-based expert system Cessna designed an airplane flight controller using 90,000 examples and 20 ahributes per example Based on Slide from M. desjardins & T. Finin