Bias Correction in Classification Tree Construction ICML 2001

Similar documents
Machine Learning 2nd Edi7on

Learning Decision Trees

Classification: Decision Trees

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Decision Trees. Tirgul 5

Bayesian Classification. Bayesian Classification: Why?

CS 6375 Machine Learning

The Naïve Bayes Classifier. Machine Learning Fall 2017

Learning Decision Trees

( D) I(2,3) I(4,0) I(3,2) weighted avg. of entropies

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Decision Support. Dr. Johan Hagelbäck.

Introduction to Machine Learning CMU-10701

Classification Using Decision Trees

UVA CS 4501: Machine Learning

Decision Trees. Gavin Brown

Rule Generation using Decision Trees


The Solution to Assignment 6

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Artificial Intelligence. Topic

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Lecture 3: Decision Trees

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

the tree till a class assignment is reached

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Algorithms for Classification: The Basic Methods

Machine Learning & Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Decision Trees.

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Decision Tree Learning and Inductive Inference

Classification and regression trees

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

COMP 328: Machine Learning

Inductive Learning. Chapter 18. Material adopted from Yun Peng, Chuck Dyer, Gregory Piatetsky-Shapiro & Gary Parker

Decision Trees. Danushka Bollegala

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Dan Roth 461C, 3401 Walnut

EECS 349:Machine Learning Bryan Pardo

Classification and Prediction

Learning Classification Trees. Sargur Srihari

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Decision Trees.

The Quadratic Entropy Approach to Implement the Id3 Decision Tree Algorithm

Induction on Decision Trees

Intuition Bayesian Classification

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Decision Tree Learning - ID3

Lecture 3: Decision Trees

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Chapter 3: Decision Tree Learning

Bayesian Learning. Bayesian Learning Criteria

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Growing a Large Tree

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Decision Trees Part 1. Rao Vemuri University of California, Davis

CS145: INTRODUCTION TO DATA MINING

COMP61011! Probabilistic Classifiers! Part 1, Bayes Theorem!

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Symbolic methods in TC: Decision Trees

Chapter 6: Classification

Building Bayesian Networks. Lecture3: Building BN p.1

Classification and Regression Trees

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

COMP61011 : Machine Learning. Probabilis*c Models + Bayes Theorem

ABC random forest for parameter estimation. Jean-Michel Marin

Data classification (II)

Decision Tree Learning

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Apprentissage automatique et fouille de données (part 2)

Administrative notes. Computational Thinking ct.cs.ubc.ca

Inteligência Artificial (SI 214) Aula 15 Algoritmo 1R e Classificador Bayesiano

Classification II: Decision Trees and SVMs

Mining Classification Knowledge

Naïve Bayes Lecture 6: Self-Study -----

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Generalization to Multi-Class and Continuous Responses. STA Data Mining I

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Typical Supervised Learning Problem Setting

SCALABLE CLASSIFICATION AND REGRESSION TREE CONSTRUCTION

RANDOMIZING OUTPUTS TO INCREASE PREDICTION ACCURACY

VBM683 Machine Learning

Reminders. HW1 out, due 10/19/2017 (Thursday) Group formations for course project due today (1 pt) Join Piazza (

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

Regression tree methods for subgroup identification I

Machine Learning 3. week

Learning with multiple models. Boosting.

Leveraging Randomness in Structure to Enable Efficient Distributed Data Analytics

Data Mining. Chapter 1. What s it all about?

Machine Learning: Symbolische Ansätze. Decision-Tree Learning. Introduction C4.5 ID3. Regression and Model Trees

Decision Tree And Random Forest

Transcription:

Bias Correction in Classification Tree Construction ICML 21 Alin Dobra Johannes Gehrke Department of Computer Science Cornell University December 15, 21

Classification Tree Construction Outlook Temp. Humidity Wind Play Tennis? Sunny Hot High Weak No Sunny Hot High Strong No Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes Rain Mild High Strong No Yes Sunny Outlook Overcast Yes No Rain Strong Wind Weak Yes

Motivating Examples Experiment 1: Two class labels equally probable: 1 and 2. Two predictor variables X 1 with 1 possible values and X 2 with 2 possible values. Both variables uncorrelated with the class label. Random datasets with N = 1 data-points. Split criteria: gini gain (Breiman 1984). g def = n k k P [X =x i ] P [C =c j X =x i ] 2 P [C =c j ] 2 i=1 j=1 j=1 Result: X 1 is chosen 8 times more often than X 2. Conclusion: gini gain biased towards predictor variables with more categories.

Motivating Examples(cont.) Experiment 2: Same setup as before but X 2 slightly correlated: P [C = 1 X 2 = x 21 ] =.51 Experiments for N = 1 and N = 1 training data-points Result: N odds X 1 vs X 2 1 62:1 1 32:1 Conclusion: gini gain chooses the wrong predictor variable to split on with high probability if only weak correlations are present.

Outline of the Talk Motivation Bias in split variable selection Formal definition of the bias. Experimental demonstration of the bias. Correction of the bias General method for bias removal. Correction of the bias of gini gain. Explanation of the bias of gini gain. Experimental evaluation. Summary and future work.

Bias in Split Selection How do we define this bias formally? Null Hypothesis (H ): Class labels are independent of predictor variables and come from pure coin flips with a multi-face coin with probabilities p 1,, p k. Notation: D : random dataset distributed according to H. s(d, X) : value of split criterion s when applied to dataset D and predictor variable X. Definition: ( ) P [s(d, X1 ) > s(d, X 2 )] Bias(X 1, X 2 ) = log 1 1 P [s(d, X 1 ) > s(d, X 2 )]

Experimental Setup Synthetic datasets generated according to H. Two predictor variables: X 1 with domain size n 1 = 1 and X 2 with domain size n 2 = 2. Number of data-points N between 1 and 1. p 1 between and 1/2. 1 Monte Carlo trials to estimate P [s(d, X 1 ) > s(d, X 2 )]. Exactly the same instances used for all split criteria. Fair coin used to break ties.

Experimental Bias of Gini Gain g def = n P [X =x i ] i=1 k P [C =c j X =x i ] 2 j=1 k P [C =c j ] 2 j=1 Bias 2.5 2 1.5 1.5 -.5 N 2 4 6 8 1-5 -4.5-4 -3.5-3 -2.5-2 -1.5-1 -.5 log1(p1)

Experimental Bias of Information Gain IG def = k Φ(P [C =c j ]) + j=1 n Φ(P [X =x i ]) i=1 k j=1 i=1 n Φ(P [C =c j X =x i ]), Φ(p) = p log p Bias 2.5 2 1.5 1.5 -.5 N 2 4 6 8 1-5 -4.5-4 -3.5-3 -2.5-2 -1.5-1 -.5 log1(p1)

Experimental Bias of Gain Ratio GR def = IG n i=1 Φ(P [X =x i]) Bias 1.2 1.8.6.4.2 -.2 N 2 4 6 8 1-5 -4.5-4 -3.5-3 -2.5-2 -1.5-1 -.5 log1(p1)

Experimental Bias of χ 2 -test χ 2 def = n i=1 j=1 k (A ij E(A ij )) 2 E(A ij ), E(A ij ) = N is j N Bias.25.2.15.1.5 -.5 -.1 -.15 -.2 N 2 4 6 8 1-5 -4.5-4 -3.5-3 -2.5-2 -1.5-1 -.5 log1(p1)

Outline of the Talk Motivation Bias in split variable selection Formal definition of the bias. Experimental demonstration of the bias. Correction of the bias General method for bias removal. Correction of the bias of gini gain. Explanation of the bias of gini gain. Experimental evaluation. Summary and future work.

General Method for Bias Removal Definition: The p-value of some observation x is the probability that randomly (according to H ) a value at least as big as x is observed. Small p-value is proof against H. Lemma: The p-value of any split criteria is unbiased if no value of the split criteria has significant probability according to H, irrespective of the tie-breaking strategy used.

Computation of the p-value of a Split Criteria Exact computation. Very expensive; efficient only for n = 2 and k = 2 (Martin 1997). Bootstrapping (Monte Carlo simulation) (Frank & Witten, 1998). Expensive for small p-values. Asymptotic approximations. E.g., χ 2 -distribution approximation of the distributions of χ 2 -test (Kass, 198). Inaccurate for border conditions like small entries in the contingency table. Tight approximations. Approximate the distribution of the criterion with a parametric distribution such that the approximation is accurate everywhere.

Tight Approximation of the Gini Gain P-value Can show theoretically: ( E( g) = n 1 1 N k j=1 p 2 j ), Var ( g) n 1 N 2 f(p j) Experimental observation: Distribution of gini gain well approximated by a Gamma distribution (with cumulative distribution function Q). New split criteria: p-value( g e ) = 1 Q ( E( g) 2 Var ( g), g ) evar ( g) E( g)

Experimental Bias of Gini Gain Bias 2.5 2 1.5 1.5 -.5 N 2 4 6 8 1-5 -4.5-4 -3.5-3 -2.5-2 -1.5-1 -.5 log1(p1)

Explanation of the bias of gini gain 25 n2=2 n1=1 2 15 density 1 5.2.4.6.8.1 gini index

Experimental Evaluation of the p-value of gini index Bias.3.25.2.15.1.5 -.5 -.1 -.15 -.2 N 2 4 6 8 1-5 -4.5-4 -3.5-3 -2.5-2 -1.5-1 -.5 log1(p1)

Summary and Future Work Defined bias as log-odds and showed experiments for different split criteria General method for bias removal: use p-value of existing criteria Corrected the bias of gini gain and showed experiments that the correction is successful Future work: Analyze experimentally and theoretically the correction for the correlated case. Find correction for CART type binary splits.