Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Similar documents
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Learning theory. Ensemble methods. Boosting. Boosting: history

CS145: INTRODUCTION TO DATA MINING

Statistical Machine Learning from Data

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Decision Trees: Overfitting

the tree till a class assignment is reached

Decision trees COMS 4771

Data Mining und Maschinelles Lernen

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Models, Data, Learning Problems

Supervised Learning via Decision Trees

1 Handling of Continuous Attributes in C4.5. Algorithm

Bayesian Learning (II)

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Statistics and learning: Big Data

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Chapter 14 Combining Models

Learning Decision Trees

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Decision Tree Learning Lecture 2

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Decision Trees.

SF2930 Regression Analysis

Learning Decision Trees

Machine Learning 3. week

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

10-701/ Machine Learning - Midterm Exam, Fall 2010

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Exam, Machine Learning, Spring 2009

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Decision Tree Learning

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Decision Trees.

FINAL: CS 6375 (Machine Learning) Fall 2014

BAGGING PREDICTORS AND RANDOM FOREST

Boosting: Foundations and Algorithms. Rob Schapire

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

ECE 5424: Introduction to Machine Learning

Boosting & Deep Learning

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Informal Definition: Telling things apart

Lecture 7: DecisionTrees

Voting (Ensemble Methods)

UVA CS 4501: Machine Learning

Machine Learning Recitation 8 Oct 21, Oznur Tastan

CS7267 MACHINE LEARNING

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Holdout and Cross-Validation Methods Overfitting Avoidance

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Midterm, Fall 2003

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Induction of Decision Trees

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

Boosting: Algorithms and Applications

Machine Learning & Data Mining

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Classification and Regression Trees

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Ensembles of Classifiers.

CS 6375 Machine Learning

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

1 Handling of Continuous Attributes in C4.5. Algorithm

Generative v. Discriminative classifiers Intuition

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Lecture 3: Decision Trees

Notes on Machine Learning for and

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Variance Reduction and Ensemble Methods

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

CS534 Machine Learning - Spring Final Exam

Chapter 3: Decision Tree Learning

CSC 411 Lecture 3: Decision Trees

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Introduction to Machine Learning CMU-10701

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Classification: Decision Trees

CS229 Supplemental Lecture notes

VBM683 Machine Learning

Decision Trees. CS 341 Lectures 8/9 Dan Sheldon

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Neural Networks and Ensemble Methods for Classification

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

Decision Tree And Random Forest

Dan Roth 461C, 3401 Walnut

Decision Trees. Tirgul 5

day month year documentname/initials 1

Transcription:

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Decision Trees Tobias Scheffer

Decision Trees One of many applications: credit risk Employed longer than 3 months Positive credit report? Rejected Collateral > 2x disposable income Unemployed? Rejected Collateral > 5x disposable income Student? Rejected Rejected Accepted Accepted Rejected 2

Decision Trees Example Head tracking and face recognition. 3

Overview Decision trees. Classification with categorical features: ID3. Classification with continuous features: C4.5. Regression: CART, model trees. Bagging. Random forests. Example: Viola-Jones algorithm for face detection. 4

Decision Trees Why? Simple to interpret. Provides a classification plus a justification for it. Rejected, because subject has been employed less than 3 months and has collateral < 2 x disposable income. Fast inference: For each instance, only one branch has to be traversed. Frequently used in image processing. Simple, efficient, scalable learning algorithm. Complex nlinear models. Works for classification and regression problems. 5

Classification Input: instance x X. Instances are represented as a vector of attributes. An instance is an assignment to the attributes. Instances will also be called feature vectors. x 1 x = x m Output: class y Y; finite set Y. e.g., {accepted, rejected}; {spam, t spam}. e.g. features like color, test results, The class is also referred to as the target attribute. 6

Classifier learning Input: training data. L = x 1, y 1,, x n, y n x i = x ii x nn Output: Classifier. f X Y e.g. a decision tree: path along the edges to a leaf provides the classification 7

Regression Input: instance x X. e.g., feature vector x = x 1 x m Output: continuous (real) value, y R Learning problem: training data with continuous target value L = x 1, y 1,, x n, y n e.g. x 1, 3.5,, x n, 2.8 8

Decision Trees Test des: Discrete attributes: de contains an attribute, branches are labeled with values. Continuous attributes; des contain comparisons, branches are labeled and. Terminal des contain a value of the target attribute Employment duration 3 months Positive credit report? Rejected Unemployed? Collateral > 5x disposable income Collateral > 2x disposable income Rejected Student? Rejected Rejected Accepted Accepted Rejected 9

Decision Trees Decision trees can be represented as decision rules, each terminal de corresponds to a rule. E.g., Rejected positive credit report employment duration 3 months unemployed. Positive credit report? Employment duration 3 months Rejected Unemployed? Collateral > 5x disposable income Collateral > 2x disposable income Rejected Student? Rejected Rejected Accepted Accepted Rejected 10

Application of Decision Trees Recursively descent along branch. In each test de, conduct test, choose the appropriate branch. In a terminal de, return the value. Positive credit report? Employment duration 3 months Rejected Unemployed? Collateral > 5x disposable income Collateral > 2x disposable income Rejected Student? Rejected Rejected Accepted Accepted Rejected 11

Learning Decision Trees Loan Credit report Employment last 3 months Collateral > 50% loan 1 Positive Yes No Yes 2 Positive No Yes Yes 3 Positive No No No 4 Negative No Yes No 5 Negative Yes No No Payed back in full Find a decision tree that predicts the correct class for the training data. Trivial way: create a tree that merely reproduces the training data. 12

Learning Decision Trees Loan Credit report Employment last 3 months Collateral > 50% loan 1 Positive Yes No Yes 2 Positive No Yes Yes 3 Positive No No No 4 Negative No Yes No 5 Negative Yes No No Payed back in full 13

Learning Decision Trees Loan Credit report Employment last 3 months Collateral > 50% loan 1 Positive Yes No Yes 2 Positive No Yes Yes 3 Positive No No No 4 Negative No Yes No 5 Negative Yes No No positive Credit report Payed back in full negative

Learning Decision Trees Loan Credit report Employment last 3 months Collateral > 50% loan 1 Positive Yes No Yes 2 Positive No Yes Yes 3 Positive No No No 4 Negative No Yes No 5 Negative Yes No No positive Employment 3 months Credit report Payed back in full negative Employment 3 months

Learning Decision Trees Loan Credit report Employment last 3 months Collateral > 50% loan 1 Positive Yes No Yes 2 Positive No Yes Yes 3 Positive No No No 4 Negative No Yes No 5 Negative Yes No No positive Employment 3 months Credit report Payed back in full negative Employment 3 months 50% Collateral 50% collateral 50% Collateral 50% Collateral??? 16

Learning of Decision Trees Perhaps more elegant: From the trees that are consistent with the training data, choose the smallest (as few des as possible). Small trees can be good because: they are easier to interpret; There are more training instances per leaf de. Hence, the class decision in each leaf is better substantiated. 17

Complete Search for the Smallest Tree How many functionally different decision trees are there? Assume m binary attributes and two classes. What is the complexity of a complete search for the smallest tree? 18

Complete Search for the Smallest Tree How many functionally different decision trees are there? Assume m binary attributes and two classes. Tree has m layers of test des 2 m 1 test des. 2 (2m ) assignments of classes to leaf des. What is the complexity of a complete search for the smallest tree? Assignments of m attributes (or de can be missing) to 2 m 1 test des: O (m + 1) 2m 1. Assignments of class labels to leaf des: O(2 (2m) ) 19

Overview Decision trees. Classification with categorical features: ID3. Classification with continuous features: C4.5. Regression: CART, model trees. Bagging. Random forests. Example: Viola-Jones algorithm for face detection. 20

Learning of Decision Trees Greedy algorithm that finds a small tree (instead of the smallest tree) but is polymial in the number of attributes. Idea for Algorithm? 21

Greedy Algorithm Top-Down Construction 1. ID3(L) 1. If all data in L have same class y, then return leaf de with class y. ID3( ) = 22

Greedy Algorithm Top-Down Construction 1. ID3(L) 1. If all data in L have same class y, then return leaf de with class y. 2. Else 1. Choose attribute x j that separates L into subsets L 1,, L k with most homogeus class distributions. 2. Let L i = {(x, y) L: x j = i}. 3. Return test de with attribute x j and children ID3(L 1,),, ID3(L k ). Credit report ID3( ) = positive negative 2 x payed back 3 x t payed back ID3( ) ID3( ) 2 x payed back 1 x t payed back 2 x t payed back 23

Greedy Algorithm Top-Down Construction ID3(L) 1. If all data in L have same class y, then return leaf de with class y. 2. Else 1. Choose attribute x j that separates L into subsets L 1,, L k with most homogeus class distributions. 2. Let L i = {(x, y) L: x j = i}. 3. Return test de with attribute x j and children ID3(L 1,),, ID3(L k ). What does it mean that an attribute splits the sample into subsets with homogeus class distributions? 24

Information Information theory uses models that involve a sender, a channel, a receiver, and a probability distribution over messages. Information is a property of messages, measured in units of bit. Information in a message (rounded) = number of bits needed to code that message in an optimal code. Sender p(y) Message y Receiver 25

Information Information theory uses models that involve a sender, a channel, a receiver, and a probability distribution over messages. The information in a message y that is sent with probability p(y) is log 2 p(y). Sender p 0 = 1 2 p 1 = 1 4 p 2 = 1 4 Message 0 I 0 = log 2 1 2 = 1bbb Receiver 26

Information Information theory uses models that involve a sender, a channel, a receiver, and a probability distribution over messages. The information in a message y that is sent with probability p(y) is log 2 p(y). Sender p 0 = 1 2 p 1 = 1 4 p 2 = 1 4 Message 1 I 1 = log 2 1 4 = 2bbb Receiver 27

Entropy Entropy is the expected information of a message. H y = k v=1 p y = v log 2 p(y = v) Entropy quantifies the receiver s uncertainty regarding the message Empirical entropy H L : use frequencies observed in data L instead of probabilities. Sender p 0 = 1 2 p 1 = 1 4 p 2 = 1 4 Receiver Message y H y = 1 2 log 2 1 2 2 1 4 log 2 1 4 = 1.5bbb 28

Entropy of Class Labels in Training Data Information / uncertainty of the class labels = expected number of bits needed to send message about class label of an instance to a receiver. Loan x 1 (Credit report) x 2 (Employment last 3 months) x 3 (Collateral > 50% loan) 1 Positive Yes No Yes 2 Positive No Yes Yes 3 Positive No No No 4 Negative No Yes No 5 Negative Yes No No y (Payed back in full) H L y = 2 5 log 2 2 5 3 5 log 2 3 5 = 0.97bbb 29

Conditional Entropy Information / uncertainty of the class labels under some condition on the features. Loan x 1 (Credit report) x 2 (Employment last 3 months) x 3 (Collateral > 50% loan) 1 Positive Yes No Yes 2 Positive No Yes Yes 3 Positive No No No 4 Negative No Yes No 5 Negative Yes No No y (Payed back in full) H L y x 1 = n = 2 2 log 2 2 2 0 5 log 2 0 5 = 0bbb 30

Conditional Entropy Information / uncertainty of the class labels under some condition on the features. Loan x 1 (Credit report) x 2 (Employment last 3 months) x 3 (Collateral > 50% loan) 1 Positive Yes No Yes 2 Positive No Yes Yes 3 Positive No No No 4 Negative No Yes No 5 Negative Yes No No y (Payed back in full) H L y x 1 = n = 2 2 log 2 2 2 0 5 log 2 0 5 = 0bbb H L y x 1 = p = 2 3 log 2 2 3 1 3 log 2 1 3 = 0.91bbb 31

Information Gain of an Attribute Reduction of entropy by splitting the data along an attribute G L x j = H L y k v=1 p L x j = v H L (y x j = v) Loan x 1 (Credit report) x 2 (Employment last 3 months) x 3 (Collateral > 50% loan) 1 Positive Yes No Yes 2 Positive No Yes Yes 3 Positive No No No 4 Negative No Yes No 5 Negative Yes No No y (Payed back in full) G L x 1 = H L y p L x 1 = p H L y x 1 = p p L x 1 = n H L y x 1 = n = 0.97 3 5 0.91 2 5 0 = 0.42bbb Splitting along x 1 reduces the uncertainty regarding the class label y by 0.42 bit. 32

Information Gain Ratio Motivation: Predicting whether a student will pass an exam. How high is the information gain of the attribute Matriculation number? Information gain favors attributes with many values. Not necessarily an indicator of good generalization. Idea: factor the information that is contained in the attribute values into the decision H L x j = k v=1 p L x j = v log 2 p L (x j = v) 33

Information Gain Ratio Idea: factor the information that is contained in the attribute values into the decision H L x j = k v=1 p L x j = v log 2 p L (x j = v) Information gain ratio: GR L x j = G L(x j ) H L (x j ) 34

Example: Info Gain Ratio Which split is better? x 1 1 2 3 4 x 2 35

Example: Info Gain Ratio Which split is better? x 1 1 2 3 4 x 2 IG L x 1 = 1 4 1 4 0 = 1 H L x 1 = 4 1 4 log 2 1 4 = 2 GR L x 1 = IG L x 1 H L x 1 = 1 2 IG L x 2 = 1 3 4 0.92 1 4 0 = 0.68 H L x 2 = 3 4 log 2 3 4 +1 4 log 2 1 4 = 0,81 GR L x 2 = IG L x 2 H L x 2 = 0.84 36

Algorithm ID3 Preconditions: Classification problems All attributes have fixed, discrete ranges of values. Idea: recursive algorithm. Choose attribute that conveys highest information regarding the class label (Use information gain, gain ratio, or Gini index). Split the training data according to the attribute. Recursively call algorithm for branches. In each branch, use each attribute only once; if all attributes have been used, return leaf de. 37

Learning Decision Trees with ID3 ID3(L, X) 1. If all data in L have same class y or X={}, then return leaf de with majority class y. 2. Else 1. For all attributes x j X, calculate split criterion GL(x j ) or GRL(x j ). 2. Choose attribute x j X with highest GL(x j ) or GRL(x j ). 3. Let L i = {(x, y) L: x j = i}. 4. Return test de with attribute x j and children ID3(L 1, X x j ),, ID3(L k, X x j ). 38

Loan x 1 (Credit report) x 2 (Employment last 3 months) x 3 (Collateral > 50% loan) y (Payed back in full) ID3: Example ID3(L, X) 1 Positive Yes No Yes 2 Positive No Yes Yes 3 Positive No No No 4 Negative No Yes No 5 Negative Yes No No 1. If all data in L have same class y or X={}, then return leaf de with majority class y. 2. Else 1. For all x j X, calculate GL(x j ). 2. Choose attribute x j X with highest GL(x j ). 3. Let L i = {(x, y) L: x j = i}. 4. Return test de with attribute x j and children ID3(L 1, X x j ),, ID3(L k, X x j ). 39

Loan x 1 (Credit report) x 2 (Employment last 3 months) x 3 (Collateral > 50% loan) y (Payed back in full) ID3: Example ID3(L, X) 1 Positive Yes No Yes 2 Positive No Yes Yes 3 Positive No No No 4 Negative No Yes No 5 Negative Yes No No 1. If all data in L have same class y or X={}, then return leaf de with majority class y. 2. Else 1. For all x j X, calculate GL(x j ). 2. Choose attribute x j X with highest GL(x j ). 3. Let L i = {(x, y) L: x j = i}. 4. Return test de with attribute x j and children ID3(L 1, X x j ),, ID3(L k, X x j ). 40

Loan x 1 (Credit report) x 2 (Employment last 3 months) x 3 (Collateral > 50% loan) y (Payed back in full) ID3: Example ID3(L, X) 1 Positive Yes No Yes 2 Positive No Yes Yes 3 Positive No No No 4 Negative No Yes No 5 Negative Yes No No 1. If all data in L have same class y or X={}, then return leaf de with majority class y. 2. Else 1. For all x j X, calculate GL(x j ). 2. Choose attribute x j X with highest GL(x j ). 3. Let L i = {(x, y) L: x j = i}. 4. Return test de with attribute x j and children ID3(L 1, X x j ),, ID3(L k, X x j ). positive Credit report negative 41

Loan x 1 (Credit report) x 2 (Employment last 3 months) x 3 (Collateral > 50% loan) y (Payed back in full) ID3: Example ID3(L, X) 1 Positive Yes No Yes 2 Positive No Yes Yes 3 Positive No No No 4 Negative No Yes No 5 Negative Yes No No 1. If all data in L have same class y or X={}, then return leaf de with majority class y. 2. Else 1. For all x j X, calculate GL(x j ). 2. Choose attribute x j X with highest GL(x j ). 3. Let L i = {(x, y) L: x j = i}. 4. Return test de with attribute x j and children ID3(L 1, X x j ),, ID3(L k, X x j ). positive Credit report negative 42

Loan x 1 (Credit report) x 2 (Employment last 3 months) x 3 (Collateral > 50% loan) y (Payed back in full) ID3: Example ID3(L, X) 1 Positive Yes No Yes 2 Positive No Yes Yes 3 Positive No No No 4 Negative No Yes No 5 Negative Yes No No 1. If all data in L have same class y or X={}, then return leaf de with majority class y. 2. Else 1. For all x j X, calculate GL(x j ). 2. Choose attribute x j X with highest GL(x j ). 3. Let L i = {(x, y) L: x j = i}. 4. Return test de with attribute x j and children ID3(L 1, X x j ),, ID3(L k, X x j ). positive Credit report negative 43

Loan x 1 (Credit report) x 2 (Employment last 3 months) x 3 (Collateral > 50% loan) y (Payed back in full) ID3: Example ID3(L, X) 1 Positive Yes No Yes 2 Positive No Yes Yes 3 Positive No No No 4 Negative No Yes No 5 Negative Yes No No 1. If all data in L have same class y or X={}, then return leaf de with majority class y. 2. Else 1. For all x j X, calculate GL(x j ). 2. Choose attribute x j X with highest GL(x j ). 3. Let L i = {(x, y) L: x j = i}. 4. Return test de with attribute x j and children ID3(L 1, X x j ),, ID3(L k, X x j ). positive Credit report negative 44

Loan x 1 (Credit report) x 2 (Employment last 3 months) x 3 (Collateral > 50% loan) y (Payed back in full) ID3: Example ID3(L, X) 1 Positive Yes No Yes 2 Positive No Yes Yes 3 Positive No No No 4 Negative No Yes No 5 Negative Yes No No 1. If all data in L have same class y or X={}, then return leaf de with majority class y. 2. Else 1. For all x j X, calculate GL(x j ). 2. Choose attribute x j X with highest GL(x j ). 3. Let L i = {(x, y) L: x j = i}. 4. Return test de with attribute x j and children ID3(L 1, X x j ),, ID3(L k, X x j ). positive Credit report negative Employment 3 months 45

Loan x 1 (Credit report) x 2 (Employment last 3 months) x 3 (Collateral > 50% loan) y (Payed back in full) ID3: Example ID3(L, X) 1 Positive Yes No Yes 2 Positive No Yes Yes 3 Positive No No No 4 Negative No Yes No 5 Negative Yes No No 1. If all data in L have same class y or X={}, then return leaf de with majority class y. 2. Else 1. For all x j X, calculate GL(x j ). 2. Choose attribute x j X with highest GL(x j ). 3. Let L i = {(x, y) L: x j = i}. 4. Return test de with attribute x j and children ID3(L 1, X x j ),, ID3(L k, X x j ). positive Credit report negative Employment 3 months 46

Loan x 1 (Credit report) x 2 (Employment last 3 months) x 3 (Collateral > 50% loan) y (Payed back in full) ID3: Example ID3(L, X) 1 Positive Yes No Yes 2 Positive No Yes Yes 3 Positive No No No 4 Negative No Yes No 5 Negative Yes No No 1. If all data in L have same class y or X={}, then return leaf de with majority class y. 2. Else 1. For all x j X, calculate GL(x j ). 2. Choose attribute x j X with highest GL(x j ). 3. Let L i = {(x, y) L: x j = i}. 4. Return test de with attribute x j and children ID3(L 1, X x j ),, ID3(L k, X x j ). positive Credit report negative Employment 3 months 47

Loan x 1 (Credit report) x 2 (Employment last 3 months) x 3 (Collateral > 50% loan) y (Payed back in full) ID3: Example ID3(L, X) 1 Positive Yes No Yes 2 Positive No Yes Yes 3 Positive No No No 4 Negative No Yes No 5 Negative Yes No No 1. If all data in L have same class y or X={}, then return leaf de with majority class y. 2. Else 1. For all x j X, calculate GL(x j ). 2. Choose attribute x j X with highest GL(x j ). 3. Let L i = {(x, y) L: x j = i}. 4. Return test de with attribute x j and children ID3(L 1, X x j ),, ID3(L k, X x j ). positive Credit report negative Employment 3 months 48

Loan x 1 (Credit report) x 2 (Employment last 3 months) x 3 (Collateral > 50% loan) y (Payed back in full) ID3: Example ID3(L, X) 1 Positive Yes No Yes 2 Positive No Yes Yes 3 Positive No No No 4 Negative No Yes No 5 Negative Yes No No 1. If all data in L have same class y or X={}, then return leaf de with majority class y. 2. Else 1. For all x j X, calculate GL(x j ). 2. Choose attribute x j X with highest GL(x j ). 3. Let L i = {(x, y) L: x j = i}. 4. Return test de with attribute x j and children ID3(L 1, X x j ),, ID3(L k, X x j ). positive Credit report negative Employment 3 months 50% collateral 49

Loan x 1 (Credit report) x 2 (Employment last 3 months) x 3 (Collateral > 50% loan) y (Payed back in full) ID3: Example ID3(L, X) 1 Positive Yes No Yes 2 Positive No Yes Yes 3 Positive No No No 4 Negative No Yes No 5 Negative Yes No No 1. If all data in L have same class y or X={}, then return leaf de with majority class y. 2. Else 1. For all x j X, calculate GL(x j ). 2. Choose attribute x j X with highest GL(x j ). 3. Let L i = {(x, y) L: x j = i}. 4. Return test de with attribute x j and children ID3(L 1, X x j ),, ID3(L k, X x j ). positive Credit report negative Employment 3 months 50% collateral 50

Loan x 1 (Credit report) x 2 (Employment last 3 months) x 3 (Collateral > 50% loan) y (Payed back in full) ID3: Example ID3(L, X) 1 Positive Yes No Yes 2 Positive No Yes Yes 3 Positive No No No 4 Negative No Yes No 5 Negative Yes No No 1. If all data in L have same class y or X={}, then return leaf de with majority class y. 2. Else 1. For all x j X, calculate GL(x j ). 2. Choose attribute x j X with highest GL(x j ). 3. Let L i = {(x, y) L: x j = i}. 4. Return test de with attribute x j and children ID3(L 1, X x j ),, ID3(L k, X x j ). positive Credit report negative Employment 3 months 50% collateral 51

Loan x 1 (Credit report) x 2 (Employment last 3 months) x 3 (Collateral > 50% loan) y (Payed back in full) ID3: Example ID3(L, X) 1 Positive Yes No Yes 2 Positive No Yes Yes 3 Positive No No No 4 Negative No Yes No 5 Negative Yes No No 1. If all data in L have same class y or X={}, then return leaf de with majority class y. 2. Else 1. For all x j X, calculate GL(x j ). 2. Choose attribute x j X with highest GL(x j ). 3. Let L i = {(x, y) L: x j = i}. 4. Return test de with attribute x j and children ID3(L 1, X x j ),, ID3(L k, X x j ). positive Credit report negative Employment 3 months 50% collateral 52

Loan x 1 (Credit report) x 2 (Employment last 3 months) x 3 (Collateral > 50% loan) y (Payed back in full) ID3: Example ID3(L, X) 1 Positive Yes No Yes 2 Positive No Yes Yes 3 Positive No No No 4 Negative No Yes No 5 Negative Yes No No 1. If all data in L have same class y or X={}, then return leaf de with majority class y. 2. Else 1. For all x j X, calculate GL(x j ). 2. Choose attribute x j X with highest GL(x j ). 3. Let L i = {(x, y) L: x j = i}. 4. Return test de with attribute x j and children ID3(L 1, X x j ),, ID3(L k, X x j ). positive Credit report negative Employment 3 months 50% collateral 53

Loan x 1 (Credit report) x 2 (Employment last 3 months) x 3 (Collateral > 50% loan) y (Payed back in full) ID3: Example ID3(L, X) 1 Positive Yes No Yes 2 Positive No Yes Yes 3 Positive No No No 4 Negative No Yes No 5 Negative Yes No No 1. If all data in L have same class y or X={}, then return leaf de with majority class y. 2. Else 1. For all x j X, calculate GL(x j ). 2. Choose attribute x j X with highest GL(x j ). 3. Let L i = {(x, y) L: x j = i}. 4. Return test de with attribute x j and children ID3(L 1, X x j ),, ID3(L k, X x j ). positive Credit report negative Employment 3 months 50% collateral 54

Overview Decision trees. Classification with categorical features: ID3. Classification with continuous features: C4.5. Regression: CART, model trees. Bagging. Random forests. Example: Viola-Jones algorithm for face detection. 55

Continuous Attributes ID3 constructs a branch for every value of the selected attribute. This only works for discrete attributes. Idea: Choose attribute value pair, use test x j v. G L x j v = H L y p L x j v H L (y x j v) p L x j > v H L (y x j > v) Problem: there are infinitely many values. Idea: use only values that occur for that attribute in the training data. 56

Learning Decision Trees with C4.5 C4.5(L) 1. If all data in L have same class y or are identical, then return leaf de with majority class y. 2. Else 1. For all discrete attributes x j X: calculate GL(x j ). 2. For all continuous attributes x j X and all values v that occur for x j in L: calculate GL(x j v). 3. If discrete attribute has highest GL(x j ): 1. Let L i = {(x, y) L: x j = i}. 2. Return test de with attribute x j and children C4.5(L 1 ),, C4.5(L k ). 4. If continuous attribute has highest GL(x j v): 1. Let L = {(x, y) L: x j v}, L > = {(x, y) L: x j >v} 2. Return test de with test x j v and children C4.5(L ), C4.5(L > ). 57

C4.5: Example x 1 0,7 0,5 0,5 1 1,5 x 2 58

C4.5: Example x 1 0,7 0,5 0,5 t x 2 0,5 f 1 1,5 x 2 59

C4.5: Example x 1 0,7 0,5 0,5 t x 2 0,5 f 1 1,5 x 2 t x 1 0,5 60

C4.5: Example x 1 0,7 0,5 0,5 t x 2 0,5 f 1 1,5 x 2 t x 1 0,5 f t x 2 0,7 f 61

C4.5: Example x 1 0,7 0,5 0,5 t x 2 0,5 f 1 1,5 x 2 t x 1 0,5 f t x 2 0,7 f t x 2 1,5 f 62

Pruning Leaf des that only are supported by a single (or very few) instances, often do t provide a good classification. Pruning: Remove test des whose leaves have less than τ instances. Collect in new leaf de that is labeled with the majority class Pruning parameter τ is a regularization parameter that has to be tuned (e.g., by cross validation). 63

Pruning with Threshold 5: Example x 1 0,7 0,5 0,5 1 1,5 x 2 x 2 0,5 t f t x 1 0,5 f t x 2 0,7 f t x 1 0,8 t x 2 1,5 f 64

Pruning with Threshold 5: Example x 1 0,7 0,5 0,5 1 1,5 x 2 x 2 0,5 t f t x 1 0,5 f t x 2 0,7 f t x 1 0,8 t x 2 1,5 f 65

Pruning with Threshold 5: Example x 1 0,7 0,5 0,5 1 1,5 x 2 x 2 0,5 t f t x 1 0,5 f t x 2 0,7 f 66

Reduced Error Pruning Split training data into a training set and a pruning validation set. Construct a decision tree using the training set (for instance, with C4.5). Starting from the bottom layer, iterate over all test des whose children are leaf des. If removing the test de and replacing it with a leaf de that predicts the majority class reduces the error rate on the pruning set, then do it! 67

Overview Decision trees. Classification with categorical features: ID3. Classification with continuous features: C4.5. Regression: CART, model trees. Bagging. Random forests. Example: Viola-Jones algorithm for face detection. 68

Regression Input: instance x X. e.g., feature vector x = x 1 x m Output: continuous (real) value, y R Learning problem: training data with continuous target value L = x 1, y 1,, x n, y n e.g. x 1, 3.5,, x n, 2.8 69

Regression Input: instance x X. e.g., feature vector x = x 1 x m Output: continuous (real) value, y R Learning goal: Low quadratic error rate + simple model. n SSS = y i f x 2 i=1 i ; MSS = 1 n y n i f x 2 i=1 i Split criterion for regression? 70

Regression Trees Variance of the target attribute on sample L: VVV L = 1 n y n i=1 i y 2 Variance = MSE of predicting the mean value. Splitting criterion: variance reduction of x j v : R L x j v = VVV L n [x j v] VVV L n [xj v] Stopping criterion: n [x j >v] n Do t create a new test de if nnnn L τ. Vaa L [xj >v] 71

Learning Regression Trees with CART CART(L) 1. If i=1 y i y 2 < τ, then return leaf de with prediction y. 2. Else 1. For all discrete attributes x j X: calculate RL(x j ). 2. For all continuous attributes x j X and all values v that occur for x j in L: calculate RL(x j v). 3. If discrete attribute has highest RL(x j ): 1. Let L i = {(x, y) L: x j = i}. 2. Return test de with attribute x j and children CART(L 1 ),, CART(L k ). 4. If continuous attribute has highest RL(x j v): 1. Let L = {(x, y) L: x j v}, L > = {(x, y) L: x j >v} 2. Return test de with test x j v and children CART(L ), CART(L > ). 72

CART- Example Which split is better? VVV L = 1 5 0.8 0.48 2 + + 0.2 0.48 2 = 0.1148 positive negative Loan x 1 (Credit report) x 1 x 2 x 2 (Employment last 3 months) 1 Positive Yes 0.8 2 Positive No 0.9 3 Positive No 0.4 4 Negative No 0.1 5 Negative Yes 0.2 y (recovery rate) y (recovery rate) 0.8 0.9 0.4 y (recovery rate) 0.1 0.2 y (recovery rate) 0.8 0.2 y (recovery rate) 0.9 0.4 0.1 VVr ppp = 1 3 0.8 0.7 2 + 0.9 0.7 2 + 0.4 0.7 2 = 0.0466 VVr nnn = 1 2 0.1 0.15 2 + 0.2 0.15 2 = 0.0025 0.1148 3 5 0.0466 2 0.0025 = 0. 000 5 VVr yyy = 1 2 0.8 0.5 2 + 0.2 0.5 2 = VVr nn = 1 3 0.9 0.46 2 + 0.4 0.46 2 + 0.1 0.46 2 = 0.1148 2 5 3 5 = 73

Model Trees Nodes in regression trees are labeled with the mean value of their training instances. Idea: Instead train local linear regression model using the instances that fall into the leaf. 74

Linear Regression Input: L = x 1, y 1,, x n, y n Linear regression model f x = θ T x + θ 0 Will be discussed in later lectue. Normal vector Residuals Points along the Regression plane are perpendicular to the rmal vector. 75

Model Trees Decision trees but with a linear regression model at each leaf de. y.2.3.4.5.6.7.8.9 1.0 1.1 1.2 1.3 x 76

Model Trees Decision trees but with a linear regression model at each leaf de. y ja ja x <.9 x <.5 x < 1.1 nein nein nein.1.2.3.4.5.6.7.8.9 1.0 1.1 1.2 1.3 x 77

Overview Decision trees. Classification with categorical features: ID3. Classification with continuous features: C4.5. Regression: CART, model trees. Bagging. Random forests. Example: Viola-Jones algorithm for face detection. 78

Ensemble Learning Assume that we have 3 models f i x. Each model has an error probability of 0.1. Let f x be the majority vote among the f i x : f x = arg max y i: f i x = y. What is the error probability of f x? 79

Ensemble Learning Assume that we have 3 models f i x. Each model has an error probability of 0.1. Let f x be the majority vote among the f i x : f x = arg max y i: f i x = y. What is the error probability of f x? Depends on how correlated the models are. If they always make identical prediction, the error rate of the ensemble is 0.1 as well. 80

Ensemble Learning Assume that we have 3 models f i x. Each model has an error probability of 0.1. Let f x be the majority vote among the f i x : f x = arg max y i: f i x = y. What is the error probability of f x if the models are independent? 2 out of 3 models have to err. There are 3 subsets of 2 models. R 3 0.1 2 = 0.03. Ensemble is much better than individual models. 81

Ensemble Learning Assume that we have n models f i x. Each model has an error probability of 0.5 ε. Let f x be the majority vote among the f i x : f x = arg max y i: f i x = y. Ensemble f x may be much better than individual models if Each model s error probability is below 0.5 The models are sufficiently uncorrelated. 82

Ensemble Learning Ensemble methods differ in how they try to obtain uncorrelated classifiers from given training data. Bagging: sample random subsets of training instances. Random forests: sample random subsets of training instances and random subsets of features. Boosting: iteratively sample subsets of the training instances such that instances that are misclassified by the current ensemble receive a higher weight. 83

Bootstrapping The bootstrapping procedure draws n instances from a set of n instances with replacement. Instances can be drawn multiple times. 1. Input: sample L of size n. 2. For i=1 k 1. Draw n instances uniformly with replacement from L into set L i. 2. Learn model f i on sample L i. 3. Return models (f 1,, f k ). 84

Bagging Bootstrap Aggregating Input: sample L of size n. 1. For i=1 k 1. Draw n instances uniformly with replacement from L into set L i. 2. Learn model f i on sample L i. 2. For classification: 1. Let f(x) be the majority vote among (f 1 (x),, f k (x)). 3. For regression: 1. Let f x = 1 k f k i=1 i(x). 85

Bagging Bootstrap Aggregating Models are somewhat uncorrelated because they have been trained on different subsets of the data. But all models use the same attributes and share a good part of the training data. Idea: also vary the attributes. 86

Random Forests Input: sample L of size n, attributes X. 1. For i=1 k 1. Draw n instances uniformly with replacement from L into set L i. 2. Draw m attributes from all attributes X into X i. 3. Learn model f i on sample L i using attributes X i. 2. For classification: 1. Let f(x) be the majority vote among (f 1 (x),, f k (x)). 3. For regression: 1. Let f x = 1 k f k i=1 i(x). 87

Random Forests Number of sampled attributes is a hyperparameter. Decision trees are sometimes trained with depthbound. Decision trees are usually trained without pruning. The number of trees is a hyperparameter; often, large, fixed values are used (e.g., 1000 trees). 88

Overview Decision trees. Classification with categorical features: ID3. Classification with continuous features: C4.5. Regression: CART, model trees. Bagging. Random forests. Example: Viola-Jones algorithm for face detection. 89

Face Detection Training Data Rectangular boxes that contain aligned images of faces (positive examples) and t faces (negative examples) in a fixed resolution. A typical resolution is 24 x 24 pixels. 90

Haar Features Subtract white rectangle of an image from black rectangle. Size, type, position, polarity are parameters. A and C detect horizontal lines, B vertical lines, D detects diagonal patterns. 91

Haar Features Subtract white rectangle of an image from black rectangle. Size, type, position, polarity are parameters. A and C detect horizontal lines, B vertical lines, D detects diagonal patterns. pixels i in green rectangle grey value(i) grey value(i) pixels i in red rectangle 92

Haar Features: Efficient Calculation Integral image: I iii x, y = x x,y y I(x, y ). Can be computed in one pass over the image. All Haar features can be calculated in constant time from the integral image B A = I iii [1,2] 2I iii [1,1] A C = I iii [1,1] (I iii [2,1] I iii [1,1] ) [1,1] [1,2] [2,1] [2,2] 93

Decision Stumps with Haar Features Decision stumps = decision trees of depth 1. 1. Input: Training data L. 2. For all available features (type, size, position, polarity, threshold): 1. calculate split criterion (reduction of error rate). 3. Create test de with best Haar feature. 94

Ensemble of Decision Stumps Use AdaBoost (a variant of boosting) to create an ensemble of decision stumps of prescribed size. The ensemble is a weighted combination of features that have a continuous value. The weighted combination is compared against a threshold. Lowering this threshold increases the recognition rate and the false-positive rate Increasing this threshold decreases the recognition rate and the false-positive rate 95

Training a Recognition Cascade Input: Training data L, number of layers n, desired recall for each layer, r[l], ensemble size for each layer, s[l]. For layer l=1...n Train an ensemble (weighted combination of features) f[l] of decision stumps of size s[l]. Adjust the threshold θ[l] such that a recall of r[l] is obtained. Conbinue training on the next layer with those instances that are classified as a face by the ensemble f[l]. 96

Face Detection Cascade First layer is an ensemble of two Haar features that has a recall of 100% and removes 50% of the nfaces. Subsequent layers have larger ensembles If f 1 θ[1] If f 2 θ[2] If f n θ[n] Not a face Not a face Not a face 97

Scanning Images for Faces 1. For all sizes of the bounding box: 1. For all positions of the bounding box in the image: 1. Resample the image window under the bounding box to the fixed resolution of the cascade (24 x 24 pixels) 2. Apply the face detection cascade. 2. Return all bounding boxes which were classified as face by the cascade. 98

Shape Regression with LBF Features Random forest is applied iteratively to predict position updates which move a face of landmarks to their correct position. 99

Summary of Decision Trees Classification Discrete attributes: ID3. Continuous attributes: C4.5. Regression: CART: mean value in each leaf. Model trees: linear regression model in each leaf. Decision trees are comprehensible models, tests provide the reason for each classification result. Applying a decision tree is fast; popular in image processing tasks (face detection, face alignment). 100