Knowledge Discovery and Data Mining

Similar documents
Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Generalization to Multi-Class and Continuous Responses. STA Data Mining I

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

the tree till a class assignment is reached

Regression tree methods for subgroup identification I

Decision Tree Learning Lecture 2

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Classification and Regression Trees

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Machine Learning 3. week

Classification: Decision Trees

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

C4.5 - pruning decision trees

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CS145: INTRODUCTION TO DATA MINING

Decision Trees Entropy, Information Gain, Gain Ratio

Decision trees COMS 4771

Decision Tree Learning

Machine Learning & Data Mining

Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning

Lecture 7 Decision Tree Classifier

Statistics and learning: Big Data

Lecture 7: DecisionTrees

Lecture 3: Decision Trees

Informal Definition: Telling things apart

Jeffrey D. Ullman Stanford University

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Learning Decision Trees

Machine Learning

Lecture 3: Decision Trees

Decision Trees. Tirgul 5

Statistical Consulting Topics Classification and Regression Trees (CART)

SF2930 Regression Analysis

Classification and regression trees

CS 6375 Machine Learning

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Learning Decision Trees

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

CSCI 5622 Machine Learning

Induction of Decision Trees

Decision Tree Learning

Discrimination Among Groups. Classification (and Regression) Trees

Notes on Machine Learning for and

Machine Learning Recitation 8 Oct 21, Oznur Tastan


Data Mining Classification Trees (2)

Data classification (II)

Dan Roth 461C, 3401 Walnut

Decision T ree Tree Algorithm Week 4 1

Decision Tree. Decision Tree Learning. c4.5. Example

Improving M5 Model Tree by Evolutionary Algorithm

Jialiang Bao, Joseph Boyd, James Forkey, Shengwen Han, Trevor Hodde, Yumou Wang 10/01/2013

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Advanced Statistical Methods: Beyond Linear Regression

Decision Tree Learning

day month year documentname/initials 1

Learning with multiple models. Boosting.

Decision Tree Learning

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Tutorial 6. By:Aashmeet Kalra

Machine Learning 2nd Edi7on

Machine Learning

Holdout and Cross-Validation Methods Overfitting Avoidance

Introduction to Machine Learning CMU-10701

Learning Decision Trees

Lecture VII: Classification I. Dr. Ouiem Bchir

Machine Learning 2010

Decision Trees.

Decision Tree And Random Forest

Chapter 6: Classification

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

UVA CS 4501: Machine Learning

CSC 411 Lecture 3: Decision Trees

Classification Using Decision Trees

Apprentissage automatique et fouille de données (part 2)

Machine Learning Linear Classification. Prof. Matteo Matteucci

Growing a Large Tree

Decision Trees Part 1. Rao Vemuri University of California, Davis

CHAPTER-17. Decision Tree Induction

Tree-based methods. Patrick Breheny. December 4. Recursive partitioning Bias-variance tradeoff Example Further remarks

Decision Trees.

Deconstructing Data Science

Machine Learning Alternatives to Manual Knowledge Acquisition

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Oliver Dürr. Statistisches Data Mining (StDM) Woche 11. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Inductive Learning. Chapter 18. Material adopted from Yun Peng, Chuck Dyer, Gregory Piatetsky-Shapiro & Gary Parker

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Supervised Learning via Decision Trees

Machine Learning

1 Handling of Continuous Attributes in C4.5. Algorithm

Statistical aspects of prediction models with high-dimensional data

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Transcription:

Knowledge Discovery and Data Mining Lecture 06 - Regression & Decision Trees Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-06-RDT 13 Feb 2015 1 / 31

Some preliminary jargon Training data Validation data Resubstitution error the error rate of a tree on the cases from which it was constructed Generalisation error the error rate of a tree on unseen cases Purity how mixed-up the training data is at a node Tom Kelsey ID5059-06-RDT 13 Feb 2015 2 / 31

Tree Generation All tree methods follow this basic scheme: We have a set of mixed-up data so immediately we need some measure of how mixed-up the data is Find the covariate-value pair (x j, s j ) that produces the most separation in the data X, y split the data into two subsets: rows in which values of x j < s j, the other rows in which x j s j the data in each is less mixed-up each split forms the root of a new tree Recurse by repeating for each subtree Tom Kelsey ID5059-06-RDT 13 Feb 2015 3 / 31

Tree Generation The methods differ based on three choices: How "mixed-up" is measured we need to measure randomness or, equivalently, levels of node purity How we decide when to stop splitting the heuristics for this are common to all methods often as simple as "stop when 4 or fewer items are in subset" How we condense the instances that fall into each split i.e. what is the actual output at a terminal node (predictions) Tom Kelsey ID5059-06-RDT 13 Feb 2015 4 / 31

Regression Trees Classical statistical approach: Mixed-up is measured by standard deviation (or any measure of variability) In ANOVA terms, find nodes with minimal within variance and hence maximal between variance Condense using the average of the instances that fall into each split (i.e. predict with the mean) Tom Kelsey ID5059-06-RDT 13 Feb 2015 5 / 31

Decision Trees Information Theory approach: Mixed-up is measured by amount of (im)purity Condense via majority class (predict the most common) Purity can be measured many ways: Entropy, Gini index & the twoing rule Gini can produce pure but small nodes the twoing rule is a tradeoff between purity and equality of data on either side of the split Twoing isn t part of the R package rpart that implements CART, so we don t consider it in detail. We also need pruning criteria since the plan is always to (a) construct a big (overfitting) tree then (b) reduce tree complexity to get a good tradeoff between resubstitution error and generalisation error Before we look at tree construction we need to learn more about randomness and differences between categorical and numeric data Tom Kelsey ID5059-06-RDT 13 Feb 2015 6 / 31

Purity A table or subtable is pure if it contains only one class In regression tree terminology, the SD of the outputs is zero In classification tree terminology, the We split and resplit in order to increase node purity Complete tree purity is analogous to overfitting Tom Kelsey ID5059-06-RDT 13 Feb 2015 7 / 31

Standard Deviations Tom Kelsey ID5059-06-RDT 13 Feb 2015 8 / 31

Types of correct/incorrect classification When considering a two-class problem, the range of prediction outcomes is small: False positive or false negative Correct negative or correct positive For a multi-class problem this range of outcomes is considerably larger: Incorrectly predicting class as j when correct class is i (for i, j 1,..., J) Correctly predicting clas as j (j 1,..., J) So we have J types of correct classification, and J 2 J types of incorrect classification Tom Kelsey ID5059-06-RDT 13 Feb 2015 9 / 31

Purity of nodes Intuitively that we want to optimise for some measure of the purity of nodes Consider J-level categorical response variable. A node gives a vector of proportions p = [p 1,..., p J ] for the levels of our response J j=1 p j = 1 so the vector p is a probability distribution of the response classes within the node Tom Kelsey ID5059-06-RDT 13 Feb 2015 10 / 31

Node purity We can list some desirable/necessary properties of an impurity measure, which will be a function of the these proportions φ(p) φ(p) will be a maximum when p = [ 1 J,..., 1 ]. This is our J definition of the least pure node, i.e. there is an equal mixture of all classes φ(p) will be a minimum when p j = 1 (and therefore the the others are zero). This is our definition of our most pure node, only one class exists Tom Kelsey ID5059-06-RDT 13 Feb 2015 11 / 31

Node purity Our measure of the impurity of a node t will be given by i(t) = φ((p 1 t),..., (p J t)) A measure of the decrease of impurity resulting from splitting node t into a left node t L and a right node t R will be given by δi(t) = i(t) p L i(t L ) p R i(t R ) where p L and p R are the proportion of points in t that go to the left and right respectively Tom Kelsey ID5059-06-RDT 13 Feb 2015 12 / 31

Logarithms: refresher b x = y if and only if log b y = x To be precise one needs to distinguish special cases, for example y cannot be 0. The log of a negative number is complex; not needed here. Inverse of exponentiation: log b b x = x = b (log b x) Base change: log a x = log b x log b a (This is why you don t need a log 2 button on your calculator.) Useful identity: log b x 1 = log b x Tom Kelsey ID5059-06-RDT 13 Feb 2015 13 / 31

Entropy Definition We work in base 2, taking the bit as our unit. We can now precisely define the entropy of a set of output classes as: H(p 1,..., p n ) = p i log 2 1 p i = p i log 2 p i Tom Kelsey ID5059-06-RDT 13 Feb 2015 14 / 31

Example Class Bus Car Train Probability 0.4 0.3 0.3 H(0.4, 0.3, 0.3) = 0.4 log 2 0.4 0.3 log 2 0.3 0.3 log 2 0.3 1.571 We say our output class data has entropy about 1.57 bits per class. Tom Kelsey ID5059-06-RDT 13 Feb 2015 15 / 31

Example Class Bus Car Train Probability 0 1 0 H(0, 1, 0) = 1 log 2 1 = 0 We say our output class data has zero entropy, meaning zero randomness Tom Kelsey ID5059-06-RDT 13 Feb 2015 16 / 31

Example Class Bus Car Train Probability 1/3 1/3 1/3 H( 1 3, 1 3, 1 3 ) = 1 3 log 1 2 3 1 3 log 1 2 3 1 3 log 1 2 3 1.584963 1.584963 1.584963 3 3 3 0.528321 + 0.528321 + 0.528321 1.584963 We say our output class data has maximum entropy for a class of this size, meaning the most randomness Tom Kelsey ID5059-06-RDT 13 Feb 2015 17 / 31

Gini Index One minus the sum of the squared output probabilities: 1 Σp 2 j In our example, 1 (0.4 2 + 0.3 2 + 0.3 2 ) = 0.660. Minimum Gini index is zero Maximum Gini index is 1 n( 1 n )2 = 1 1, two thirds in our n example Tom Kelsey ID5059-06-RDT 13 Feb 2015 18 / 31

Entropy and Gini compared Node impurity measures versus class proportion for 2-class problem Tom Kelsey ID5059-06-RDT 13 Feb 2015 19 / 31

Misclassification error rate Defined as the number of incorrect classifications divided by the number of all classifications Hence, using terminology from Lecture 4, equal to 1 minus the accuracy of a classification predictor: MER = 1 a + d a + b + c + d In the context of analysing nodes, this is 1 minus the maximum proportion in p = [p 1,..., p J ]: MER = 1 max(p j ) From the chart, Entropy and Gini capture more of the notion of node impurity, and so are preferred measures for tree growth Misclassification is used extensively in tree pruning Tom Kelsey ID5059-06-RDT 13 Feb 2015 20 / 31

Binary output, numeric attributes, Gini index Worked example from the published literature Incomplete calculations Show how to obtain Gini-gain at a potential knot position See RegTree.xlsx Data from a study published in the European Journal of Cancer Current Gold Standard diagnostic predictor is age Can AMH improve prediction? If so, by how much? Tom Kelsey ID5059-06-RDT 13 Feb 2015 21 / 31

Study Cohort Pretreatment anti-müllerian hormone predicts for loss of ovarian function after chemotherapy for early breast cancer. RA Anderson, M Rosendahl, TW Kelsey and DA Cameron, European Journal of Cancer 49(16): 3404-3411, 2013 Tom Kelsey ID5059-06-RDT 13 Feb 2015 22 / 31

Our Data Pretreatment anti-müllerian hormone predicts for loss of ovarian function after chemotherapy for early breast cancer. RA Anderson, M Rosendahl, TW Kelsey and DA Cameron, European Journal of Cancer 49(16): 3404-3411, 2013 Tom Kelsey ID5059-06-RDT 13 Feb 2015 23 / 31

Binary output, numeric attributes, Gini index Recipe 1 Use a pivot table to get output (i.e. A & M) proportions the base Gini index is 1 (p(a) 2 + p(m) 2 ) 2 Choose a split position for covariate AMH 3 Work out the numbers of A & M above and below the split 4 Create a contingency matrix 5 Gini gain is: original Gini - p(above)*gini(above) - p(below)*gini(below) 6 repeat for all other candidate split positions 7 repeat for Age instead of AMH 8 Select split position at greatest Gini gain 9 Start all over again at the split nodes Tom Kelsey ID5059-06-RDT 13 Feb 2015 24 / 31

Contingency Matrix Tom Kelsey ID5059-06-RDT 13 Feb 2015 25 / 31

Grown Tree Tom Kelsey ID5059-06-RDT 13 Feb 2015 26 / 31

Partition of the Euclidean Plane Tom Kelsey ID5059-06-RDT 13 Feb 2015 27 / 31

Initial Validation Analysis Tom Kelsey ID5059-06-RDT 13 Feb 2015 28 / 31

Pruned Tree Tom Kelsey ID5059-06-RDT 13 Feb 2015 29 / 31

Pruned Partition of the Euclidean Plane Pretreatment anti-müllerian hormone predicts for loss of ovarian function after chemotherapy for early breast cancer. RA Anderson, M Rosendahl, TW Kelsey and DA Cameron, European Journal of Cancer 49(16): 3404-3411, 2013 Tom Kelsey ID5059-06-RDT 13 Feb 2015 30 / 31

Next Lecture Carl Donovan is standing in for me More on tree building Internal validation pruning the trees Tom Kelsey ID5059-06-RDT 13 Feb 2015 31 / 31