ML (cont.): DECISION TREES

Similar documents
Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Generative v. Discriminative classifiers Intuition

Decision Tree Learning Lecture 2

Decision Trees (Cont.)

the tree till a class assignment is reached

Decision Trees Lecture 12

Supervised Learning Methods

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Lecture 7: DecisionTrees

CS 6375 Machine Learning

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Decision Trees. Tirgul 5

Midterm, Fall 2003

Decision trees COMS 4771

Introduction to Machine Learning CMU-10701

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Introduction to Data Science Data Mining for Business Analytics

Learning Decision Trees

Jeffrey D. Ullman Stanford University

UVA CS 4501: Machine Learning

Decision Tree Learning

Decision Tree Learning

Tutorial 6. By:Aashmeet Kalra

Classification and Regression Trees

Classification Trees

Notes on Machine Learning for and

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Decision Trees. Gavin Brown

Decision Trees. None Some Full > No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. Patrons? WaitEstimate? Hungry? Alternate?

Decision Tree Learning

Classification: Decision Trees

Dan Roth 461C, 3401 Walnut

Learning Decision Trees

CSC 411 Lecture 3: Decision Trees

CS145: INTRODUCTION TO DATA MINING

Decision Tree Learning

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Lecture 3: Decision Trees

Machine Learning

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Learning Decision Trees

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Lecture 3: Decision Trees

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Tufts COMP 135: Introduction to Machine Learning

SF2930 Regression Analysis

Informal Definition: Telling things apart

Statistical Learning. Philipp Koehn. 10 November 2015

Machine Learning 3. week

Machine Learning 2nd Edi7on

Final Exam, Machine Learning, Spring 2009

Bayesian Networks. 10. Parameter Learning / Missing Values

Knowledge Discovery and Data Mining

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Decision Trees Part 1. Rao Vemuri University of California, Davis

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Decision Trees: Overfitting

Learning with multiple models. Boosting.

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Deconstructing Data Science

Decision Tree Learning

Holdout and Cross-Validation Methods Overfitting Avoidance

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

EECS 349:Machine Learning Bryan Pardo

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

day month year documentname/initials 1

Decision Trees Entropy, Information Gain, Gain Ratio

CHAPTER-17. Decision Tree Induction

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

Statistical Consulting Topics Classification and Regression Trees (CART)

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

Lecture 7 Decision Tree Classifier

Machine Learning

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

18.9 SUPPORT VECTOR MACHINES

Artificial Intelligence Decision Trees

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Jialiang Bao, Joseph Boyd, James Forkey, Shengwen Han, Trevor Hodde, Yumou Wang 10/01/2013

EE 511 Online Learning Perceptron

AE = q < H(p < ) + (1 q < )H(p > ) H(p) = p lg(p) (1 p) lg(1 p)

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

CS534 Machine Learning - Spring Final Exam

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Machine Learning (CS 567) Lecture 2

Information Theory, Statistics, and Decision Trees

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

From inductive inference to machine learning

Chapter 3: Decision Tree Learning

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

ECE 5424: Introduction to Machine Learning

Transcription:

ML (cont.): DECISION TREES CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 Some slides from Andrew Moore http://www.cs.cmu.edu/~awm/tutorials and Chuck Dyer 1 / 45

x: Review The input (aka example, point, instance, item) Usually represented by a feature vector Composed of features (aka attributes) For decision trees (DTs), we focus on discrete features (continuous features are possible, see end of slides) 2 / 45

Example: Mushrooms 3 / 45

Example: Mushroom features 1. cap-shape: b=bell, c=conical, x=flat, k=knobbed, s=sunken 2. cap-surface: f=fibrous, g=grooves, y=scaly, s=smooth 3. cap-color: n=brown, b=buff, c=cinnamon, g=gray, r=green, p=pink, u=purple, e=red, w=white, y=yellow 4. bruises?: t=bruises, f=no 5. odor: a=almond, l=anise, c=creosote, y=fishy, f=foul, m=musty, n=none, p=pungent, s=spicy 6. gill-attachment: a=attached, d=descending, f=free, n=notched 7.... for example : x 1 = [b,g,r,t,f,n,...] 4 / 45

y: Review The output (aka label, target, goal) It can be... Continuous Regression (e.g. population prediction) Discrete Classification (e.g. is mushroom x edible or poisonous?) 5 / 45

Example: Two Mushrooms x 1 = [x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u] y = p x 2 = [x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g] y = e 1. cap-shape: b=bell, c=conical, x=flat, k=knobbed, s=sunken 2. cap-surface: f=fibrous, g=grooves, y=scaly, s=smooth 3. cap-color: n=brown, b=buff, c=cinnamon, g=gray, r=green, p=pink, u=purple, e=red, w=white, y=yellow 4. bruises?: t=bruises, f=no 5.... 6 / 45

Supervised Learning: Review Training set: n pairs of example,label: (x 1, y 1 ),..., (x n, y n ) A function (aka hypotheses) f : x y Hypothesis space (subset of function family): e.g. the set of dth order polynomials Goal: find the best function in the hypothesis space that generalizes well Performance measure: MSE for regression, accuracy or error rate for classification 7 / 45

Evaluating Classifiers: Review During training Train classifier from a training set: (x1, y 1 ),..., (x n, y n ). During testing For new test data x n+1,..., x n+m, classifier generates predicted labels: ŷ n+1,..., ŷ n+m. Test set accuracy Need to know the true test labels: yn+1,..., y n+m. Test set accuracy: acc = 1 m n+m i=n+1 1{y i = ŷ} Test set error rate: 1 acc 8 / 45

Decision Trees Another kind of classifier (SL) The tree Algorithm Mutual Information of questions Overfitting and Pruning Extension: real-valued features, tree rules, pro/con 9 / 45

Decision Trees (cont.) A decision tree has 2 kinds of nodes: leaf node: has a class label determined by majority vote of training examples reaching that leaf internal node: a question on features. Branches out according to the answers 10 / 45

Automobile Miles-Per-Gallon Prediction mpg cylinders displacement horsepower weight acceleration modelyear maker good 4 low low low high 75to78 asia bad 6 medium medium medium medium 70to74 america bad 4 medium medium medium low 75to78 europe bad 8 high high high low 70to74 america bad 6 medium medium medium medium 70to74 america bad 4 low medium low medium 70to74 asia bad 4 low medium low low 70to74 asia bad 8 high high high low 75to78 america : : : : : : : : : : : : : : : : : : : : : : : : bad 8 high high high low 70to74 america good 8 high medium high high 79to83 america bad 8 high high high low 75to78 america good 4 low low low low 79to83 america bad 6 medium medium medium high 75to78 america good 4 medium low low low 79to83 america good 4 low low medium high 79to83 america bad 8 high high high low 70to74 america good 4 low medium low medium 75to78 europe bad 5 medium medium medium medium 75to78 europe 11 / 45

A Small Decision Tree Internal node question: number of cylinders? number bad number good root 22 18 cylinders=3 0 0 cylinders=4 4 17 cylinders=5 1 0 cylinders=6 8 0 cylinders=8 9 1 Predict Bad Predict Good Predict Bad Predict Bad Predict Bad Leaves: classify by majority vote 12 / 45

A Bigger DT root 22 18 where s it made? cylinders=3 0 0 cylinders=4 4 17 cylinders=5 1 0 cylinders=6 8 0 Predict Bad Predict Bad Predict Bad what s the horsepower cylinders=8 9 1 maker=am 0 10 maker=as 2 5 maker=eur 2 2 Predict Good Predict Good Predict Bad hrspwr=l 0 0 hrspwr=m 0 1 hrspwr=h 9 0 Predict Bad Predict Good Predict Bad 13 / 45

The Full DT internal 22 18 1. Don t split: all instances same label (pure) cylinders=3 0 0 cylinders=4 4 17 cylinders=5 1 0 cylinders=6 8 0 Predict Bad Predict Bad Predict Bad cylinders=8 9 1 maker=am 0 10 Predict Good maker=as 2 5 maker=eur 2 2 hrspwr=l 0 0 hrspwr=m 0 1 hrspwr=h 9 0 Predict Bad Predict Good Predict Bad hrspwr=l 0 4 hrspwr=m 2 1 hrspwr=h 0 0 accel=l 1 0 accel=m 0 1 Predict Good Predict Bad Predict Bad Predict Good 2.Can t split: when no more questions (not true here) accel=h 1 1 accel=l 1 0 accel=m 1 1 accel=h 0 0 Predict Bad Predict Bad Predict Bad myr=70t74 0 1 myr=75t78 1 0 Predict Good Predict Bad myr=79t83 0 0 Predict Bad 14 / 45

The Decision Tree Algorithm root.buildtree(examples,questions,default) /* examples: a list of training examples questions: set of candidate questions (e.g. value of feature i? ) default: default label prediction (e.g. over-all majority vote) */ IF empty(examples) THEN this.setprediction(default) IF (examples all same label y) THEN this.setprediction(y) IF empty(questions) THEN this.setprediction(examples maj. vote) q = bestquestion(examples, questions) For j = 1..n answers to q node = new Node; this.addchild(node) node.buildtree({examples q=answer j},{questions\q},default) 15 / 45

The Best Question What we want: pure leaf nodes pure means all examples have (almost) the same label y A good question splits examples into pure child nodes How to measure how much purity results from a question? One possiblity (called Max-Gain in the book): Mutual Information or Information Gain (a quantity from information theory) But this is a measure of change of purity We still need to find a measure of purity/impurity first... 16 / 45

The Best Question: Entropy Imagine, at a node there are n = n 1 +... + n k examples: n1 examples have label y 1 n2 examples have label y 2... n k examples have label y k What s the impurity of the node? Imagine this game: I put all of the examples in a bag... then pull one out at random. What is the probability the example has label y i? p i! but how do we calculate p i? 17 / 45

The Best Question: Entropy (cont.) We ll estimate p i from our examples: with probability p 1 = n1 n, the example has label y 1 with probability p 2 = n2 n, the example has label y 2... with probability pk = n k n, the example has label y k so that p 1 + p 2 +... + p k = 1 The outcome of the draw is a random variable y with probability (p 1, p 2,..., p k ) What s the impurity of the node is the same as asking: What s the uncertainty of y in a random drawing 18 / 45

The Best Question: Entropy Defined H(Y ) = = k P r(y = y i ) log 2 P r(y = y i ) i=1 k p i log 2 p i i=1 Interpretation: Entropy (H) is the number of yes/no questions (bits) needed on average to pin down the value of y in a random drawing 19 / 45

The Best Question: Entropy, some examples 20 / 45

The Best Question: Conditional Entropy H(Y X = v) = H(Y X) = k P r(y = y i X = v) log 2 P r(y = y i X = v) i=1 P r(x = v)h(y X = v) v {values ofx} Y : label X : a question (e.g. a feature v : an answer to the question P r(y X = v) : conditional probability 21 / 45

The Best Question: Information Gain Information Gain, or Mutual Information I(Y ; X) = H(Y ) H(Y X) Choose question (feature) X which maximizes I(Y ; X) arg max I(Y ; X) X 22 / 45

The Best Question: Example Features: color, shape, size What s the best question at root? + - 23 / 45

The Best Question: Example, Information Gain Example Color Shape Size Class 1 Red Square Big + 2 Blue Square Big + 3 Red Circle Big + 4 Red Circle Small - 5 Green Square Small - 6 Green Square Big - + - H(class) = H ( 3 6, 3 6) = 1 H(class color) = 3 [ 6 H( 2 3, 1 3 )] + 1 6 [H(1, 0)]+ 2 6 [H(0, 1)] =.46 3 out of 6 are red, 2 of those are +; 1 out of 6 is blue, that one is +; 2 out of 6 are green, those are - I(class; color) = H(class) H(class color) = 0.54 bits 24 / 45

The Best Question: Example, Information Gain (cont.) Example Color Shape Size Class 1 Red Square Big + 2 Blue Square Big + 3 Red Circle Big + 4 Red Circle Small - 5 Green Square Small - 6 Green Square Big - + - H(class) = 1 I(class; color) = 0.54 bits H(class shape) = 4 6 [ H( 1 2, 1 2 )] + 2 6 [ H( 1 2, 1 2 )] = 1 I(class; shape) = H(class) H(class shape) = 0 bits Shape tells us nothing about class! 25 / 45

The Best Question: Example, Information Gain (cont.) Example Color Shape Size Class 1 Red Square Big + 2 Blue Square Big + 3 Red Circle Big + 4 Red Circle Small - 5 Green Square Small - 6 Green Square Big - + - H(class) = 1 I(class; color) = 0.54 bits I(class; shape) = 0 bits H(class size) = 4 [ 6 H( 3 4, 1 4 )] + 2 6 [H(0, 1)] = 0.54 I(class; size) = H(class) H(class size) = 0.46 bits 26 / 45

The Best Question: Example, Information Gain (cont.) Example Color Shape Size Class 1 Red Square Big + 2 Blue Square Big + 3 Red Circle Big + 4 Red Circle Small - 5 Green Square Small - 6 Green Square Big - + - H(class) = 1 I(class; color) = 0.54 bits I(class; shape) = 0 bits I(class; size) = 0.46 bits We select color as the question at root! 27 / 45

Overfitting Example: Predicting US Population Given some training data (n = 11) What will the population be in 2020? x=year y=millions 1900 75.995 1910 91.972 1920 105.71 1930 123.2 1940 131.67 1950 150.7 1960 179.32 1970 203.21 1980 226.51 1990 249.63 2000 281.42 28 / 45

Overfitting Example: Regression - Polynomial Fit The degree d (complexity of the model) is important: f(x) = c d x d + c d 1 x d 1 +... + c 1 x + c 0 Want to fit (or learn) coefficients c d,..., c 0 to minimize Mean Squared Error (MSE) on training data: MSE = 1 n n (y i f(x i )) 2 i=1 Matlab demo: USpopulation.m 29 / 45

Overfitting Example: Regression - Polynomial Fit (cont.) As d increases, MSE on training set improves, but prediction outside data worsens degree MSE 0 4181.4526 1 79.6005 2 9.3469 3 9.2896 4 7.4201 5 5.3101 6 2.4932 7 2.2783 8 1.2580 9 0.0014 10 0.0000 30 / 45

Overfitting a Decision Tree construct a special training set feature vector of five bits create every possible configuration (32 configurations) set y = e, then randomly flip 25% of the y labels 32 records a b c d e y 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 0 1 : : : : : : 1 1 1 1 1 1 31 / 45

Overfitting a Decision Tree (cont.) Test set is constructed similarly y = e, then a different 25% corrupted to y = e corruptions in training and test are independent Training and Test sets have many identical feature, label pairs Some labels different between Training and Test 32 / 45

Overfitting a Decision Tree (cont.) Building the full tree on the Training Set: root e=0 a=0 a=1 e=1 a=0 a=1 Training set accuracy = 100% Remember: 25% of nodes are corrupted (y e) 33 / 45

Overfitting a Decision Tree (cont.) Then classify the Test Set with the learned tree: root e=0 a=0 a=1 e=1 a=0 a=1 Different 25% corrupted - independent of the training data. 34 / 45

Overfitting a Decision Tree (cont.) On average: of training data uncorrupted 3 4 1 4 3 4 1 4 of these are uncorrupted in test correct predicted labels of these are corrupted in test incorrect predictions of training data corrupted 3 4 of these are uncorrupted in test incorrect predictions 1 4 of these are also corrupted in test correct predictions Test Set Accuracy = ( ( 3 3 ( 4) 4) + 1 ( 1 ) 4) 4 = 5 8 = 62.5% 35 / 45

Overfitting a Decision Tree but if we knew a,b,c,d were irrelevant features and didn t use them in the tree... Pretend these don t exist 32 records a b c d e y 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 0 1 : : : : : : 1 1 1 1 1 1 36 / 45

Overfitting a Decision Tree (cont.) The tree would be... root e=0 e=1 In training data, about 3 4 y s are 0 here. Majority vote predicts y=0 In training data, about 3 4 y s are 1 here. Majority vote predicts y=1 In test data, 1 4 of the y s are different from e. Test accuracy =? Test accuracy = 3 4 = 75% (better than full tree test accuracy of 62.5%!) 37 / 45

Overfitting a Decision Tree (cont.) In the full tree, we overfit by learning non-existent relations (noise) root e=0 a=0 a=1 e=1 a=0 a=1 38 / 45

Avoiding Overfitting: Pruning How to prune with a tuning set: 1. Randomly split data into TRAIN and TUNE sets (say 70% and 30% for instance) 2. Build a full tree using only TRAIN 3. Prune the tree using the TUNE set (next page shows a greedy version) 39 / 45

Greedy Pruning Algorithm Prune(tree T, TUNE set) 1. compute T s accuracy on TUNE, call it A(T ) 2. For every internal node N in T : 2.1 Create new tree T N by: copy T, but prune (delete) subtree under N 2.2 N becomes a leaf node in T N : label is majority vote of TRAIN examples reaching N 2.3 Calculate A(T N ) = T N s accuracy on TUNE 3. Find T, the tree (among T N s and T ) with largest A() 4. Set T T (pruning!) 5. Repeat from step 1 until no more improvement available (A() does not increase) 6. Return T (the fully pruned tree) 40 / 45

Real-valued Features What if some (or all) of the features x 1, x 2,..., x k are real-valued? Example: x 1 = height (in inches) Idea 1: branch on each possible numerical value (fragments the training data, prone to overfitting, with caveat) Idea 2: use questions like (x 1 > t?), where t is a threshold. There are fast ways to try all(?) t. H(y x i > t?) = p(x i > t)h(y x i > t) + p(x i t)h(y x i t) I(y; x i > t?) = H(y) H(y x i > t?) 41 / 45

What does the feature space look like? Axis-parallel cuts 42 / 45

Trees as Rules Each path, from root to a leaf, corresponds to a rule antecedent is all of decistions leading to leaf consequent is classification at the leaf node For example: from the tree in color/shape/size example, we can generate: IF ([color]=red) AND ([size]=big) THEN + 43 / 45

Summary Decision trees are popular tools for data mining Easy to understand Easy to implement Easy to use Computationally cheap Overfiting might happen pruning! We ve used decision trees for classification (predict categorical output from categorical or real inputs) 44 / 45

What you should know Trees for classification Top-down tree construction algorithm Information Gain (includes Entropy and Conditional Entropy) Overfitting Pruning Dealing with real-valued features 45 / 45