Decision Trees Lecture 12

Similar documents
Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Generative v. Discriminative classifiers Intuition

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

CSC 411 Lecture 3: Decision Trees

Introduction to Machine Learning CMU-10701

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

the tree till a class assignment is reached

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Decision Tree Learning Lecture 2

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Learning Decision Trees

Learning Decision Trees

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Learning Decision Trees

Decision Tree Learning

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Support vector machines Lecture 4

Linear classifiers Lecture 3

Notes on Machine Learning for and

Decision Tree Learning

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Lecture 7: DecisionTrees

Classification and Regression Trees

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

UVA CS 4501: Machine Learning

CS145: INTRODUCTION TO DATA MINING

Empirical Risk Minimization, Model Selection, and Model Assessment

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

CS 6140: Machine Learning Spring 2016

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Machine Learning 2nd Edi7on

ML (cont.): DECISION TREES

CS 6375 Machine Learning

Machine Learning

Lecture 3: Decision Trees

Machine Learning

Variance Reduction and Ensemble Methods

Information Gain, Decision Trees and Boosting ML recitation 9 Feb 2006 by Jure

Decision Trees: Overfitting

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

EECS 349:Machine Learning Bryan Pardo

PAC-learning, VC Dimension and Margin-based Bounds

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Machine Learning

Classification: Decision Trees

Dan Roth 461C, 3401 Walnut

Information Theory, Statistics, and Decision Trees

Decision Trees. CS 341 Lectures 8/9 Dan Sheldon

Decision Trees. Gavin Brown

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Decision Trees.

Induction of Decision Trees

Midterm, Fall 2003

Machine Learning 3. week

Decision Trees. Tirgul 5

Introduction to Data Science Data Mining for Business Analytics

C4.5 - pruning decision trees

Statistical Machine Learning from Data

Decision trees COMS 4771

Supervised Learning via Decision Trees

Learning Theory Continued

Information Theory & Decision Trees

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Boos$ng Can we make dumb learners smart?

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Day 5: Generative models, structured classification

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Decision Trees (Cont.)

Decision Trees.

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

ECE 5424: Introduction to Machine Learning

VBM683 Machine Learning

Stochastic Gradient Descent

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Artificial Intelligence Roman Barták

Lecture 3: Decision Trees

Decision Tree Learning

Decision Tree Learning. Dr. Xiaowei Huang

PAC-learning, VC Dimension and Margin-based Bounds

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Statistical Learning. Philipp Koehn. 10 November 2015

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Data Mining und Maschinelles Lernen

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Chapter 3: Decision Tree Learning

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Linear Classifiers: Expressiveness

Nonlinear Classification

10701/15781 Machine Learning, Spring 2007: Homework 2

Decision Trees. Danushka Bollegala

Transcription:

Decision Trees Lecture 12 David Sontag New York University Slides adapted from Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Machine Learning in the ER Physician documentation Triage Information (Free text) MD comments (free text) Specialist consults 2 hrs 30 min T=0 Lab results (Continuous valued) Repeated vital signs (continuous values) Measured every 30 s Disposition

Can we predict infec>on? Triage Information (Free text) MD comments (free text) Specialist consults Physician documentation Many crucial decisions about a patient s care are made here! Lab results (Continuous valued) Repeated vital signs (continuous values) Measured every 30 s

Can we predict infec>on? Previous automa>c approaches based on simple criteria: Temperature < 96.8 F or > 100.4 F Heart rate > 90 beats/min Respiratory rate > 20 breaths/min Too simplified e.g., heart rate depends on age!

Can we predict infec>on? These are the ayributes we have for each pa>ent: Temperature Heart rate (HR) Respiratory rate (RR) Age Acuity and pain level Diastolic and systolic blood pressure (DBP, SBP) Oxygen Satura>on (SaO2) We have these ayributes + label (infec>on) for 200,000 pa>ents! Let s learn to classify infec>on

Predic>ng infec>on using decision trees

Hypotheses: decision trees f : X Y Each internal node tests an attribute x i Each branch assigns an attribute value x i =v Each leaf assigns a class y To classify input x: traverse the tree from root to leaf, output the labeled y Cylinders 3 4 5 6 8 good bad bad Maker america asia europe bad good good bad Horsepower low med high good bad Human interpretable!

How many possible hypotheses? What functions can be represented? Hypothesis space Cylinders 3 4 5 6 8 good bad bad Maker Horsepower america asia europe low med high bad good good bad good bad

What func>ons can be represented? Decision trees can represent any func>on of the input ayributes! A B A xor B F F F F T T T F T T T F F F B A F T B T F T T T F For Boolean func>ons, path to leaf gives truth table row Cylinders (Figure from Stuart Russell) But, could require exponen>ally many nodes 3 4 5 6 8 good bad bad Maker Horsepower america asia europe low med high bad good good bad good bad cyl=3 (cyl=4 (maker=asia maker=europe))

Are all decision trees equal? Many trees can represent the same concept But, not all trees will have the same size! e.g., φ = (A B) ( A C) - - ((A and B) or (not A and C)) A B C t t f f + _ t f + _ Which tree do we prefer? B C C t f f + t f + _ A t f A _ + _ t t f

Learning decision trees is hard!!! Learning the simplest (smallest) decision tree is an NP- complete problem [Hyafil & Rivest 76] Resort to a greedy heuris>c: Start from empty decision tree Split on next best a4ribute (feature) Recurse

A Decision Stump

Key idea: Greedily learn trees using recursion Records in which cylinders = 4 Take the Original Dataset.. And partition it according to the value of the attribute we split on Records in which cylinders = 5 Records in which cylinders = 6 Records in which cylinders = 8

Recursive Step Build tree from These records.. Build tree from These records.. Build tree from These records.. Build tree from These records.. Records in which cylinders = 4 Records in which cylinders = 5 Records in which cylinders = 6 Records in which cylinders = 8

Second level of tree Recursively build a tree from the seven records in which there are four cylinders and the maker was based in Asia (Similar recursion in the other cases)

A full tree

Spliong: choosing a good ayribute Would we prefer to split on X 1 or X 2? t Y=t : 4 Y=f : 0 X 1 f Y=t : 1 Y=f : 3 t Y=t : 3 Y=f : 1 X 2 f Y=t : 2 Y=f : 2 Idea: use counts at leaves to define probability distributions, so we can measure uncertainty! X 1 X 2 Y T T T T F T T T T T F T F T T F F F F T F F F F

Measuring uncertainty Good split if we are more certain about classifica>on aper split Determinis>c good (all true or all false) Uniform distribu>on bad What about distribu>ons in between? P(Y=A) = 1/2 P(Y=B) = 1/4 P(Y=C) = 1/8 P(Y=D) = 1/8 P(Y=A) = 1/4 P(Y=B) = 1/4 P(Y=C) = 1/4 P(Y=D) = 1/4

Entropy Entropy H(Y) of a random variable Y Entropy of a coin flip More uncertainty, more entropy! Information Theory interpretation: H(Y) is the expected number of bits needed to encode a randomly drawn value of Y (under most efficient code) Entropy Probability of heads

High Entropy High, Low Entropy Y is from a uniform like distribu>on Flat histogram Values sampled from it are less predictable Low Entropy Y is from a varied (peaks and valleys) distribu>on Histogram has many lows and highs Values sampled from it are more predictable (Slide from Vibhav Gogate)

Entropy Example Entropy of a coin flip Entropy Probability of heads P(Y=t) = 5/6 P(Y=f) = 1/6 H(Y) = - 5/6 log 2 5/6-1/6 log 2 1/6 = 0.65 X 1 X 2 Y T T T T F T T T T T F T F T T F F F

Condi>onal Entropy Condi>onal Entropy H( Y X) of a random variable Y condi>oned on a random variable X Example: P(X 1 =t) = 4/6 P(X 1 =f) = 2/6 t Y=t : 4 Y=f : 0 X 1 f Y=t : 1 Y=f : 1 H(Y X 1 ) = - 4/6 (1 log 2 1 + 0 log 2 0) - 2/6 (1/2 log 2 1/2 + 1/2 log 2 1/2) = 2/6 X 1 X 2 Y T T T T F T T T T T F T F T T F F F

Informa>on gain Decrease in entropy (uncertainty) aper spliong In our running example: IG(X 1 ) = H(Y) H(Y X 1 ) = 0.65 0.33 IG(X 1 ) > 0 we prefer the split! X 1 X 2 Y T T T T F T T T T T F T F T T F F F

Learning decision trees Start from empty decision tree Split on next best a4ribute (feature) Use, for example, informa>on gain to select ayribute: Recurse

When to stop? First split looks good! But, when do we stop?

Base Case One Don t split a node if all matching records have the same output value

Base Case Two Don t split a node if data points are identical on remaining attributes

Base Cases: An idea Base Case One: If all records in current data subset have the same output then don t recurse Base Case Two: If all records have exactly the same set of input ayributes then don t recurse Proposed Base Case 3: If all attributes have small information gain then don t recurse This is not a good idea

The problem with proposed case 3 y = a XOR b The information gains:

If we omit proposed case 3: y = a XOR b The resulting decision tree: Instead, perform pruning after building a tree

Decision trees will overfit

Decision trees will overfit Standard decision trees have no learning bias Training set error is always zero! (If there is no label noise) Lots of variance Must introduce some bias towards simpler trees Many strategies for picking simpler trees Fixed depth Fixed number of leaves Random forests

Real- Valued inputs What should we do if some of the inputs are real- valued? Infinite number of possible split values!!!

One branch for each numeric value idea: Hopeless: hypothesis with such a high branching factor will shatter any dataset and overfit

Threshold splits Binary tree: split on ayribute X at value t One branch: X < t Other branch: X t Requires small change Allow repeated splits on same variable along a path bad Year <78 78 Year bad good <70 70 good

The set of possible thresholds Binary tree, split on ayribute X One branch: X < t Other branch: X t Search through possible values of t Seems hard!!! But only a finite number of t s are important: X j tc c 1 t 2 Sort data according to X into {x 1,,x m } Consider split points of the form x i + (x i+1 x i )/2 Morever, only splits between examples of different classes mayer! X j tc c 1 t 2 (Figures from Stuart Russell)

Picking the best threshold Suppose X is real valued with threshold t Want IG(Y X:t), the informa>on gain for Y when tes>ng if X is greater than or less than t Define: H(Y X:t) = p(x < t) H(Y X < t) + p(x >= t) H(Y X >= t) IG(Y X:t) = H(Y) - H(Y X:t) IG*(Y X) = max t IG(Y X:t) Use: IG*(Y X) for con>nuous variables

Example with MPG

Example tree for our con>nuous dataset

What you need to know about decision trees Decision trees are one of the most popular ML tools Easy to understand, implement, and use Computa>onally cheap (to solve heuris>cally) Informa>on gain to select ayributes (ID3, C4.5, ) Presented for classifica>on, can be used for regression and density es>ma>on too Decision trees will overfit!!! Must use tricks to find simple trees, e.g., Fixed depth/early stopping Pruning Or, use ensembles of different trees (random forests)