Decision trees for stream data mining new results

Similar documents
A Novel Gaussian Based Similarity Measure for Clustering Customer Transactions Using Transaction Sequence Vector

Online Estimation of Discrete Densities using Classifier Chains

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

26 Chapter 4 Classification

The ABACOC Algorithm: a Novel Approach for Nonparametric Classification of Data Streams

Holdout and Cross-Validation Methods Overfitting Avoidance

Knowledge Discovery and Data Mining

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008)

Machine Learning 3. week

Informal Definition: Telling things apart

Generalization Error on Pruning Decision Trees

Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning

A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation

Vector Quantization Encoder Decoder Original Form image Minimize distortion Table Channel Image Vectors Look-up (X, X i ) X may be a block of l

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

the tree till a class assignment is reached

Data Mining and Analysis: Fundamental Concepts and Algorithms

Evaluation Metrics for Intrusion Detection Systems - A Study

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Model updating mechanism of concept drift detection in data stream based on classifier pool

OVER the last few years, the design of learning systems. Less Is More: A Comprehensive Framework for the Number of Components of Ensemble Classifiers

arxiv: v1 [cs.ds] 28 Sep 2018

On Multi-Class Cost-Sensitive Learning

Distributed Data Mining for Pervasive and Privacy-Sensitive Applications. Hillol Kargupta

Pattern-Based Decision Tree Construction

Jeffrey D. Ullman Stanford University

Adaptive Learning and Mining for Data Streams and Frequent Patterns

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Classification Using Decision Trees

CS 584 Data Mining. Association Rule Mining 2

Machine Learning: Exercise Sheet 2

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

CS 484 Data Mining. Association Rule Mining 2

1 Handling of Continuous Attributes in C4.5. Algorithm

Rule Generation using Decision Trees

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms

Decision Trees Entropy, Information Gain, Gain Ratio

Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies

43400 Serdang Selangor, Malaysia Serdang Selangor, Malaysia 4

Unsupervised Anomaly Detection for High Dimensional Data

Machine Learning & Data Mining

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Multi-Plant Photovoltaic Energy Forecasting Challenge with Regression Tree Ensembles and Hourly Average Forecasts

Spatial Decision Tree: A Novel Approach to Land-Cover Classification

CS145: INTRODUCTION TO DATA MINING

ECE533 Digital Image Processing. Embedded Zerotree Wavelet Image Codec

Reducing The Data Transmission in Wireless Sensor Networks Using The Principal Component Analysis

Maximum Likelihood Approach for Symmetric Distribution Property Estimation

A Novel Dencos Model For High Dimensional Data Using Genetic Algorithms

Using HDDT to avoid instances propagation in unbalanced and evolving data streams

Advanced Techniques for Mining Structured Data: Process Mining

UC Riverside UC Riverside Previously Published Works

Generalization to Multi-Class and Continuous Responses. STA Data Mining I

Chapter 14 Combining Models

( D) I(2,3) I(4,0) I(3,2) weighted avg. of entropies

Artificial Intelligence Decision Trees

Agnostic Online learnability

Probabilistic model for Intrusion Detection in Wireless Sensor Network

Kazuyuki Tanaka s work on AND-OR trees and subsequent development

C4.5 - pruning decision trees

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Midterm: CS 6375 Spring 2015 Solutions

- An Image Coding Algorithm

Midterm, Fall 2003

EVALUATING MISCLASSIFICATION PROBABILITY USING EMPIRICAL RISK 1. Victor Nedel ko

Decision T ree Tree Algorithm Week 4 1

Data Mining und Maschinelles Lernen

CS 6375 Machine Learning

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

A Bayesian Approach to Concept Drift

Computing Correlation Anomaly Scores using Stochastic Nearest Neighbors

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Classification: Decision Trees

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

Machine Learning: Pattern Mining

The space complexity of approximating the frequency moments

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Machine Learning 2nd Edi7on

Information Theory, Statistics, and Decision Trees

Newscast EM. Abstract

Notes on Machine Learning for and

Learning Time-Series Shapelets

Q1 (12 points): Chap 4 Exercise 3 (a) to (f) (2 points each)

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Data Mining Project. C4.5 Algorithm. Saber Salah. Naji Sami Abduljalil Abdulhak

arxiv: v1 [cs.ai] 23 Apr 2015

Data Mining Classification

Chapter 6: Classification

The University of Iowa Intelligent Systems Laboratory The University of Iowa. f1 f2 f k-1 f k,f k+1 f m-1 f m f m- 1 D. Data set 1 Data set 2

arxiv: v2 [eess.sp] 20 Nov 2017

Approximate counting: count-min data structure. Problem definition

Multivariate interdependent discretization in discovering the best correlated attribute

Learning Decision Trees

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

Time series denoising with wavelet transform

Transcription:

Decision trees for stream data mining new results Leszek Rutkowski leszek.rutkowski@iisi.pcz.pl Lena Pietruczuk lena.pietruczuk@iisi.pcz.pl Maciej Jaworski maciej.jaworski@iisi.pcz.pl Piotr Duda piotr.duda@iisi.pcz.pl Czestochowa University of Technology Institute of Computational Intelligence

1 Decision trees for stream data mining new results 2 The Hoeffding's inequality incorrectly used for stream data mining 3 New techniques to derive split measures 4 Comparison of our results with the Hoeffding s bound 5 How much data is enough to make a split? 2

[1] Leszek Rutkowski, Lena Pietruczuk, Piotr Duda, Maciej Jaworski, Decision trees for mining data streams based on the McDiarmid's bound, IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 6, pp. 1272 1279, 2013. [2] Leszek Rutkowski, Maciej Jaworski, Lena Pietruczuk, Piotr Duda, Decision trees for mining data streams based on the Gaussian approximation, IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 1, pp. 108-119, 2014. [3] Leszek Rutkowski, Maciej Jaworski, Lena Pietruczuk, Piotr Duda, The CART decision tree for mining data streams, Information Sciences, vol. 266, pp. 1 15, 2014. 3 3

The commonly known algorithm called Hoeffding s Tree was introduced by P. Domingos and G. Hulten in [4]. The main mathematical tool used in this algorithm was the Hoeffding s inequality [5] Theorem: If X 1, X 2,, X n are independent random variables and a i X i b i (i = 1, 2,, n), then for ε > 0 where X = 1 n P X E[ X] ε e 2n2 ε 2 / i=1 i=1 n n (bi a i ) 2 X i and E[ X] is expected value of X. [4] P. Domingos and G. Hulten, Mining high-speed data streams, Proc. 6th ACM SIGKDD Internat. Conf. on Knowledge Discovery and Data Mining, pp. 71-80, 2000. [5] W. Hoeffding, Probability inequalities for sums of bounded random variables, Journal of the American Statistical Association, vol. 58, issue 301, pp. 13-30, March 1963. 4

If X i, i = 1,, n, are random variables of range R then the Hoeffding s bound takes the form P X E[ X] R2 ln 1/δ 2n 1 δ 5

The Hoeffding's inequality is wrong tool to solve the problem of choosing the best attribute to make a split in the node using Gini index (e.g. the CART algorithm) or information entropy (e.g. the ID3 algorithm). This observation follows from the fact that: The most known split measures based on information entropy or Gini index, can not be presented as a sum of elements. They are using only frequency of elements. The Hoeffding s inequality is applicable only for numerical data. 6

Therefore the idea presented in [4] violates the assumptions of the Hoeffding s theorem (see [5]) and the concept of Hoeffding Trees has no theoretical justification. [4] P. Domingos and G. Hulten, Mining high-speed data streams, Proc. 6th ACM SIGKDD Internat. Conf. on Knowledge Discovery and Data Mining, pp. 71-80, 2000. [5] W. Hoeffding, Probability inequalities for sums of bounded random variables, Journal of the American Statistical Association, vol. 58, issue 301, pp. 13-30, March 1963. 7

In our papers [1], [2] and [3] we challenge all the results presented in the world literature (2000-2014) based on the Hoeffding s inequality. In particular, we challenge the following result: 8

In our papers [1], [2], [3]: a) we study the following decision trees and split measures: - Information gain (ID3) - Gini gain (CART) b) we propose two different techniques to derive split measures: - The McDiarmid s inequality - The Gaussian approximation c) we replace incorrectly obtained bound ε H (see the previous slide) by new bounds shown on the next slide d) we determine the number of data elements sufficient enough to make a split 9

Information gain Gini index gain Hoeffding s bound ε H = R2 ln 1/δ 2N Incorrectly obtained ε H = R2 ln 1/δ 2N Incorrectly obtained McDiarmid s bound ε M1 = C Gain (K, N) ln 1/δ 2N C Gain K, N = 6(K log 2 en+log 2 2N)+2log 2 K see [1] ε M2 = 8 see [1] ln 1/δ 2N Gaussian approximation Q C = log2 2 e2 Ce 2 ε G1 = z (1 δ) + log2 2 e2 e 2 see [2] 2Q(C) N + log2 2 e e + log2 2 (2C) 4 ε G2 = z (1 δ) 2Q(K) N Q K = 5K 2 8K + 4 see [3] Where: N denotes the number of data elements K denotes the number of classes z (1 δ) denotes the (1 δ)-th quantile of the standard normal distribution N(0; 1) C = 1 th th and th is a threshold value 10

I. For information gain the value of ε G1 is much better than the value of ε M1, however it can be applied only to a two class problems. For details see [2]. II. For Gini index the value of ε G2 gives better results than ε M2 for K = 2. For K > 2 the value of ε M2 is more efficient. For details see [3]. 11

Hoeffding s bound Information gain N > R2 ln 1/δ 2(ε H ) 2 Incorrectly obtained Gini index gain N > R2 ln 1/δ 2(ε H ) 2 Incorrectly obtained McDiarmid s bound Gaussian approximation Q C = log2 2 e2 Ce 2 Can be determined numerically see [1] 2Q(C) 2 N > (z 1 δ ) (ε G1 ) 2 + log2 2 e2 e 2 + log2 2 e e see [2] + log2 2 (2C) 4 32 ln 1/δ N > (ε M2 ) 2 see [1] N > 2Q(K)(z 1 δ) 2 (ε G2 ) 2 Q K = 5K 2 8K + 4 see [3] 12