Using HDDT to avoid instances propagation in unbalanced and evolving data streams

Similar documents
When is undersampling effective in unbalanced classification tasks?

A Novel Concept Drift Detection Method in Data Streams Using Ensemble Classifiers

Learning Under Concept Drift: Methodologies and Applications

Online Estimation of Discrete Densities using Classifier Chains

Expert Systems with Applications

A Multi-partition Multi-chunk Ensemble Technique to Classify Concept-Drifting Data Streams

CS570 Data Mining. Anomaly Detection. Li Xiong. Slide credits: Tan, Steinbach, Kumar Jiawei Han and Micheline Kamber.

QuantTree: Histograms for Change Detection in Multivariate Data Streams

IMBALANCED DATA. Phishing. Admin 9/30/13. Assignment 3: - how did it go? - do the experiments help? Assignment 4. Course feedback

Adaptive Learning and Mining for Data Streams and Frequent Patterns

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

On Multi-Class Cost-Sensitive Learning

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees

Voting Massive Collections of Bayesian Network Classifiers for Data Streams

An Empirical Study of Building Compact Ensembles

Learning Decision Trees

CSE 5243 INTRO. TO DATA MINING

Data Mining Part 4. Prediction

Anomaly Detection via Over-sampling Principal Component Analysis

ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth

Incremental Learning of New Classes from Unbalanced Data

Click Prediction and Preference Ranking of RSS Feeds

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

CHAPTER-17. Decision Tree Induction

Incremental Learning with the Minimum Description Length Principle

Compact Ensemble Trees for Imbalanced Data

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

On Appropriate Assumptions to Mine Data Streams: Analysis and Practice

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning with Feature Evolvable Streams

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Sampling Strategies to Evaluate the Performance of Unknown Predictors

Index of Balanced Accuracy: A Performance Measure for Skewed Class Distributions

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

An Approach to Classification Based on Fuzzy Association Rules

Evaluation. Albert Bifet. April 2012

1 Introduction. Keywords: Discretization, Uncertain data

Class Prior Estimation from Positive and Unlabeled Data

Introduction to Machine Learning Midterm Exam

FINAL: CS 6375 (Machine Learning) Fall 2014

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Lossless Online Bayesian Bagging

Bayesian Classification. Bayesian Classification: Why?

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

A Bayesian Approach to Concept Drift

P leiades: Subspace Clustering and Evaluation

Neural Networks and Ensemble Methods for Classification

CS534 Machine Learning - Spring Final Exam

Anomaly Detection via Online Oversampling Principal Component Analysis

Learning Decision Trees

TDT4173 Machine Learning

Weight Initialization Methods for Multilayer Feedforward. 1

Midterm, Fall 2003

Midterm: CS 6375 Spring 2018

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Generative MaxEnt Learning for Multiclass Classification

EECS 349:Machine Learning Bryan Pardo

Incremental Weighted Naive Bayes Classifiers for Data Stream

Selection of Classifiers based on Multiple Classifier Behaviour

Real-time Sentiment-Based Anomaly Detection in Twitter Data Streams

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

8. Classifier Ensembles for Changing Environments

Relation between Pareto-Optimal Fuzzy Rules and Pareto-Optimal Fuzzy Rule Sets

An asymmetric entropy measure for decision trees

Evaluation methods and decision theory for classification of streaming data with temporal dependence

Ensemble determination using the TOPSIS decision support system in multi-objective evolutionary neural network classifiers

A Discretization Algorithm for Uncertain Data

Data Mining and Knowledge Discovery. Petra Kralj Novak. 2011/11/29

Lecture 10: Logistic Regression

43400 Serdang Selangor, Malaysia Serdang Selangor, Malaysia 4

EVALUATING MISCLASSIFICATION PROBABILITY USING EMPIRICAL RISK 1. Victor Nedel ko

Modern Information Retrieval

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Introduction to Data Science Data Mining for Business Analytics

ECLT 5810 Classification Neural Networks. Reference: Data Mining: Concepts and Techniques By J. Hand, M. Kamber, and J. Pei, Morgan Kaufmann

The ABACOC Algorithm: a Novel Approach for Nonparametric Classification of Data Streams

Final Examination CS 540-2: Introduction to Artificial Intelligence

Type Independent Correction of Sample Selection Bias via Structural Discovery and Re-balancing

Using Additive Expert Ensembles to Cope with Concept Drift

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Brief Introduction of Machine Learning Techniques for Content Analysis

Introduction to Machine Learning Midterm Exam Solutions

Generative v. Discriminative classifiers Intuition

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Unsupervised Anomaly Detection for High Dimensional Data

Matheuristics for Ψ-Learning

Aijun An and Nick Cercone. Department of Computer Science, University of Waterloo. methods in a context of learning classication rules.

Rule Generation using Decision Trees

Pattern-Based Decision Tree Construction

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Midterm: CS 6375 Spring 2015 Solutions

A Robust Decision Tree Algorithm for Imbalanced Data Sets

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier

Top-k Parametrized Boost

Classification Using Decision Trees

Clustering non-stationary data streams and its applications

Transcription:

Using HDDT to avoid instances propagation in unbalanced and evolving data streams IJCNN 2014 Andrea Dal Pozzolo, Reid Johnson, Olivier Caelen, Serge Waterschoot, Nitesh V Chawla and Gianluca Bontempi 07/07/2014 1/ 27

INTRODUCTION In many applications (e.g. Fraud detection), we have a stream of observations that we want to classify in real time. When one class is rare, the classification problem is said to be unbalanced. In this paper we focus unbalanced data stream with two classes, positive (minority) and negative (majority). In unbalanced data streams, state-of-the-art techniques use instance propagation and standard decision trees (e.g. C4.5 [13]). However, it is not always possible to either revisit or store old instances of a data stream. 2/ 27

INTRODUCTION II We use Hellinger Distance Decision Tree [7] (HDDT) to deal with skewed distributions in data streams. We consider data streams where the instances are received over time in streams of batches. HDDT allows us to remove instance propagations between batches, benefits: improved predictive accuracy speed single-pass through the data. An ensemble of HDDTs is used to combat concept drift and increase the accuracy of single classifiers. We test our framework on several streaming datasets with unbalanced classes and concept drift. 3/ 27

HELLINGER DISTANCE DECISION TREES HDDT [6] use Hellinger Distance as splitting criteria in decision trees for unbalanced problems. All numerical features are partitioned into p bins (only categorical variables). The Hellinger Distance between the positive and negative class is computed for each feature and then the feature with the maximum distance is used to split the tree. p ( f +j d H (f +, f ) = f + f j f j=1 where f + denote the positive instances and f the negatives of feature f. Note that the class priors do not appear explicitly in equation 1. )2 (1) 4/ 27

CONCEPT DRIFT The concept to learn can change due to non-stationary distributions. In the presence of concept drift, we cannot longer assume that the training and testing batches come from the same distribution. Concept drift [11] can occur due to a change in: P(c), class priors. P(x c), distribution of the classes. P(c x), posterior distributions of class membership. 5/ 27

CONCEPT DRIFT II In general we don t know where the changes in distributions come from, but we can measure the distances between two batches to estimate the similarities of the distributions. If the batch used for training a model is not similar to the testing batch, we can assume that the model is obsolete. In our work we use batch distances to assign low weights to obsolete models. 6/ 27

HELLINGER DISTANCE BETWEEN BATCHES Let B t be the batch at time t used for training and B t+1 the subsequent testing batch. We compute the distance between B t and B t+1 for a given feature f using Hellinger distance as proposed by Lichtenwalter and Chawla [12]: HD(B t, B t+1, f ) = B t f =v B t v f B t+1 f =v B t+1 where B t f =v is the number of instances of feature f taking value v in the batch at time t, while B t is the total number of instances in the same batch. 2 (2) 7/ 27

INFORMATION GAIN Equation 2 does not account for differences in feature relevance. Lichtenwalter and Chawla [12] suggest to use the information gain to weight the distances. For a given feature f of a batch B, the Information Gain (IG) is defined as the decrease in entropy E of a class c: IG(B, f ) = E(B c ) E(B c B f ) (3) where B c defines the class of the observations in batch B and B f the observations of feature f. 8/ 27

HELLINGER DISTANCE AND INFORMATION GAIN We can make the assumption that feature relevance remains stable over time and define a new distance function that combines IG and HD as: HDIG(B t, B t+1, f ) = HD(B t, B t+1, f ) (1 + IG(B t, f )) (4) HDIG(B t, B t+1, f ) provides a relevance-weighted distance for each single feature. The final distance between two batches is then computed taking the average over all the features. D HDIG (B t, B t+1 ) = f B t HDIG(Bt, B t+1, f ) f B t (5) 9/ 27

MODELS WEIGHTS In evolving data streams it is important to retain models learnt in the past and learn new models as soon as new data comes available. We learn a new model as soon as a new batch is available and store all the models. The learnt models are then combined into an ensemble where the weights are inversely proportional to the batches distances. The lower the distance between two batches the more similar is the concept between them. Lichtenwalter and Chawla [12] suggest to use the following transformation: 1 weights t = D HDIG (B t, B t+1 ) b (6) 10/ 27 where b represents the ensemble size.

EXPERIMENTAL SETUP We compare HDDT with C4.5 [13] decision tree which is the standard for unbalanced data streams [10, 9, 12, 8]. We considered instance propagation methods that assume no sub-concepts within the minority class: SE (Gao s [8] propagation of rare class instances and undersampling at 40%) BD (Lichtenwalter s Boundary Definition [12]: propagating rare-class instances and instances in the negative class that the current model misclassifies.) UNDER (Undersampling: no propagation between batches, undersampling at 40%) BL (Baseline: no propagation, no sampling) For each of the previous sampling strategies we tested: HDIG: D HDIG weighted ensemble. No ensemble: single classifier. 11/ 27

DATASETS We used different types of datasets. UCI datasets [1] to first study the unbalanced problem without worrying about concept drift. Artificial datasets from MOA [2] framework with drifting features to test the behavior of the algorithms under concept drift. A credit card dataset (online payment transactions between the 5 th of September and the 25 th of September 2013). 12/ 27 Name Source Instances Features Imbalance Ratio Adult UCI 48,842 14 3.2:1 can UCI 443,872 9 52.1:1 compustat UCI 13,657 20 27.5:1 covtype UCI 38,500 10 13.0:1 football UCI 4,288 13 1.7:1 ozone-8h UCI 2,534 72 14.8:1 wrds UCI 99,200 41 1.0:1 text UCI 11,162 11465 14.7:1 DriftedLED MOA 1,000,000 25 9.0:1 DriftedRBF MOA 1,000,000 11 1.02:1 DriftedWave MOA 1,000,000 41 2.02:1 Creditcard FRAUD 3,143,423 36 658.8:1

RESULTS UCI DATASETS HDDT is better than C4.5 and ensembles have higher accuracy. HDIG No Ensemble HDIG No Ensemble 4000 0.75 3000 AUROC 0.50 Algorithm C4.5 HDDT TIME 2000 Algorithm C4.5 HDDT 1000 0.25 0 13/ 27 0.00 BD BL SE UNDER BD BL SE UNDER Sampling Figure : Batch average results in terms of AUROC (higher is better) using different sampling strategies and batch-ensemble weighting methods with C4.5 and HDDT over all UCI datasets. BD BL SE UNDER BD BL SE UNDER Sampling Figure : Batch average results in terms of computational TIME (lower is better) using different sampling strategies and batch-ensemble weighting methods with C4.5 and HDDT over all UCI datasets.

RESULTS MOA DATASETS HDIG-based ensembles return better results than a single classifiers and HDDT again gives better accuracy than C4.5. HDIG No Ensemble 0.8 0.6 AUROC 0.4 Algorithm C4.5 HDDT 0.2 0.0 BD BL SE UNDER BD BL SE UNDER Sampling Figure : Batch average results in terms of computational AUROC (higher is better) using different sampling strategies and batch-ensemble weighting methods with C4.5 and HDDT over all drifting MOA datasets. 14/ 27

RESULTS CREDIT CARD DATASET Figure 5 shows the sum of the ranks for each strategy over all the chunks. For each chunk, we assign the highest rank to the most accurate strategy and then sum the ranks over all chunks. 1.00 HDIG No Ensemble AUROC BL_HDIG_HDDT UNDER_HDIG_HDDT AUROC 0.75 0.50 0.25 0.00 BD BL SE UNDER BD BL SE UNDER Sampling Algorithm C4.5 HDDT SE_HDIG_HDDT UNDER_HDIG_C4.5 BD_HDIG_HDDT BD_HDDT BD_HDIG_C4.5 BL_HDDT UNDER_HDDT SE_HDIG_C4.5 SE_HDDT BL_HDIG_C4.5 UNDER_C4.5 BD_C4.5 SE_C4.5 BL_C4.5 Best significant FALSE TRUE 0 100 200 15/ 27 Figure : Batch average results in terms of AUROC (higher is better) using different sampling strategies and batch-ensemble weighting methods with C4.5 and HDDT over Credit card dataset. Sum of the ranks Figure : Sum of ranks in all chunks for the Credit card dataset in terms of AUROC. In gray are the strategies that are not significantly worse than the best having the highest sum of ranks.

CONCLUSION In unbalanced data streams, state-of-the-art techniques use instance propagation/sampling to deal with skewed batches. HDDT without sampling typically leads to better results than C4.5 with sampling. The removal of the propagation/sampling step in the learning process has several benefits: It allows a single-pass approach (avoids several passes throughout the batches for instance propagation). It reduces the computational cost (with massive amounts of data it may no longer be possible to store/retrieve old instances). It avoids the problem of finding previous minority instances from the same concept. 16/ 27

CONCLUSION II HDIG ensemble weighting strategy: has better performances than single classifiers. takes into account drifts in the data stream. With the credit card dataset HDDT performs very well when combined with BL (no sampling) and UNDER sampling. These sampling strategies are the fastest, since no observations are stored from previous chunks. Future work will investigate propagation methods such as REA [4], SERA [3] and HUWRS.IP [9] that are able to deal with sub-concepts in the minority class. THANK YOU FOR YOUR ATTENTION 17/ 27

BIBLIOGRAPHY I D. N. A. Asuncion. UCI machine learning repository, 2007. A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. Moa: Massive online analysis. The Journal of Machine Learning Research, 99:1601 1604, 2010. S. Chen and H. He. Sera: selectively recursive approach towards nonstationary imbalanced stream data mining. In Neural Networks, 2009. IJCNN 2009. International Joint Conference on, pages 522 529. IEEE, 2009. S. Chen and H. He. Towards incremental learning of nonstationary imbalanced data stream: a multiple selectively recursive approach. Evolving Systems, 2(1):35 50, 2011. D. A. Cieslak and N. V. Chawla. Detecting fractures in classifier performance. In Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on, pages 123 132. IEEE, 2007. D. A. Cieslak and N. V. Chawla. Learning decision trees for unbalanced data. In Machine Learning and Knowledge Discovery in Databases, pages 241 256. Springer, 2008. 18/ 27

BIBLIOGRAPHY II D. A. Cieslak, T. R. Hoens, N. V. Chawla, and W. P. Kegelmeyer. Hellinger distance decision trees are robust and skew-insensitive. Data Mining and Knowledge Discovery, 24(1):136 158, 2012. J. Gao, B. Ding, W. Fan, J. Han, and P. S. Yu. Classifying data streams with skewed class distributions and concept drifts. Internet Computing, 12(6):37 49, 2008. T. R. Hoens and N. V. Chawla. Learning in non-stationary environments with class imbalance. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 168 176. ACM, 2012. T. R. Hoens, N. V. Chawla, and R. Polikar. Heuristic updatable weighted random subspaces for non-stationary environments. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pages 241 250. IEEE, 2011. M. G. Kelly, D. J. Hand, and N. M. Adams. The impact of changing populations on classifier performance. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 367 371. ACM, 1999. R. N. Lichtenwalter and N. V. Chawla. Adaptive methods for classification in arbitrarily imbalanced and drifting data streams. In New Frontiers in Applied Data Mining, pages 53 75. Springer, 2010. 19/ 27

BIBLIOGRAPHY III J. R. Quinlan. C4. 5: programs for machine learning, volume 1. Morgan kaufmann, 1993. C. R. Rao. A review of canonical coordinates and an alternative to correspondence analysis using hellinger distance. Questiió: Quaderns d Estadística, Sistemes, Informatica i Investigació Operativa, 19(1):23 63, 1995. 20/ 27

HELLINGER DISTANCE Quantify the similarity between two probability distributions [14] Recently proposed as a splitting criteria in decision trees for unbalanced problems [6, 7]. In data streams, it has been used to detect classifier performance degradation due to concept drift [5, 12]. 21/ 27

HELLINGER DISTANCE II Consider two probability measures P 1, P 2 of discrete distributions in Φ. The Hellinger distance is defined as: d H (P 1, P 2 ) = ( P1 (φ) 2 P 2 (φ)) (7) φ Φ Hellinger distance has several properties: d H (P 1, P 2 ) = d H (P 1, P 2 ) (symmetric) d H (P 1, P 2 ) >= 0 (non-negative) d H (P 1, P 2 ) [0, 2] d H is close to zero when the distributions are similar and close to 2 for distinct distributions. 22/ 27

CONCEPT DRIFT II Let us define X t = {x 0, x 1,..., x t } as the set of labeled observations available at time t. For a new unlabelled instance x t+1 we can train a classifier on X t and predict P(c i x t+1 ). Using Bayes theorem we can write P(c i x t+1 ) as: P(c i x t+1 ) = P(c i)p(x t+1 c i ) (8) P(x t+1 ) Since P(x t+1 ) is the same for all classes c i we can remove P(x t+1 ): P(c i x t+1 ) = P(c i )P(x t+1 c i ) (9) Kelly [11] argues that concept drift can occur from a change in any of the terms in equation 9, namely: P(c i ), class priors. P(x t+1 c i ), distribution of the classes. P(c i x t+1 ), posterior distributions of class membership. 23/ 27

CONCEPT DRIFT III The change in posterior distributions is the worst type of drift, because it directly affects the performance of a classifier [9]. 24/ 27 In general we don t know where the changes in distributions come from, but we can measure the distances between two batches to estimate the similarities of the distributions.