# Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees

Save this PDF as:

Size: px
Start display at page:

Download "Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees"

## Transcription

1 Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees Tomasz Maszczyk and W lodzis law Duch Department of Informatics, Nicolaus Copernicus University Grudzi adzka 5, Toruń, Poland Abstract. Shannon entropy used in standard top-down decision trees does not guarantee the best generalization. Split criteria based on generalized entropies offer different compromise between purity of nodes and overall information gain. Modified C4.5 decision trees based on Tsallis and Renyi entropies have been tested on several high-dimensional microarray datasets with interesting results. This approach may be used in any decision tree and information selection algorithm. Key words: Decision rules, entropy, information theory, information selection, decision trees 1 Introduction Decision tree algorithms are still the foundation of most large data mining packages, offering easy and computationally efficient way to extract simple decision rules [1]. They should always be used as a reference, with more complex classification models justified only if they give significant improvement. Trees are based on recursive partitioning of data and unlike most learning systems they use different sets of features in different parts of the feature space, automatically performing local feature selection. This is an important and unique property of general divide-and-conquer algorithms that has not been paid much attention. The hidden nodes in the first neural layer weight the inputs in a different way, calculating specific projections of the input data on a line defined by the weights g k = W (k) X. This is still non-local feature that captures information from a whole sector of the input space, not from the localized region. On the other hand localized radial basis functions capture only the local information around some reference points. Recursive partitioning in decision trees is capable of capturing local information in some dimensions and non-local in others. This is a desirable property that may be used in neural algorithms based on localized projected data [2]. The C4.5 algorithm to generate trees [3] is still the basis of the most popular approach in this field. Tests for partitioning data in C4.5 decision trees are based on the concept of information entropy and applied to each feature x 1,x 2,... individually. Such tests create two nodes that should on the one hand contain

2 2 T. Maszczyk and W. Duch data that are as pure as possible (i.e. belong to a single class), and on the other hand increase overall separability of data. Tests that are based directly on indices measuring accuracy are optimal from Bayesian point of view, but are not so accurate as those based on information theory that may be evaluated with greater precision [4]. Choice of the test is always a hidden compromise in how much weight is put on the purity of samples in one or both nodes and the total gain achieved by partitioning of data. It is therefore worthwhile to test other types of entropies that may be used as tests. Essentially the same reasoning may be used in applications of entropy-based indices in feature selection. For some data features that can help to distinguish rare cases are important, but standard approaches may rank them quite low. In the next section properties of Shannon, Renyi and Tsallis entropies are described. As an example of application three microarray datasets are analyzed in the third section. This type of application is especially interesting for decision trees because of the high dimensionality of microarray data, the need to identify important genes and find simple decision rules. More sophisticated learning systems do not seem to achieve significantly higher accuracy. Conclusions are given in section four. 2 Theoretical framework Entropy is the measure of disorder in physical systems, or an amount of information that may be gained by observations of disordered systems. Claude Shannon defined a formal measure of entropy, called Shannon entropy[5]: S = n p i log 2 p i (1) i=1 where p i is the probability of occurrence of an event (feature value) x i being an element of the event (feature) X that can take values {x 1...x n }. The Shannon entropy is a decreasing function of a scattering of random variable, and is maximal when all the outcomes are equally likely. Shannon entropy is may be used globally, for the whole data, or locally, to evaluate entropy of probability density distributions around some points. This notion of entropy can be generalized to provide additional information about the importance of specific events, for example outliers or rare events. Comparing entropy of two distributions, corresponding for example to two features, Shannon entropy assumes implicite certain tradeoff between contributions from the tails and the main mass of this distribution. It should be worthwhile to control this tradeoff explicitly, as in many cases it may be important to distinguish weak signal overlapping with much stronger one. Entropy measures that depend on powers of probability, n i=1 p(x i) α, provide such control. If α has large positive value this measure is more sensitive to events that occur often, while for large negative α it is more sensitive to the events which happen seldom.

3 Entropies in Decision Trees 3 Constantino Tsallis [6] and Alfred Renyi [7] both proposed generalized entropies that for α = 1 reduce to the Shannon entropy. The Renyi entropy is defined as [7]: ( I α = 1 n ) 1 α log p α i (2) i=1 It has similar properties as the Shannon entropy: it is additive it has maximum = ln(n) forp i =1/n but it contains additional parameter α which can be used to make it more or less sensitive to the shape of probability distributions. Tsallis defined his entropy as: S α = 1 α 1 ( 1 Figures 1-3 shows illustration and comparison of Renyi, Tsallis and Shannon entropies for two probabilities p 1 and p 2 where p 1 =1 p 2. n i=1 p α i ) (3) alpha= 8 alpha= 3 alpha= 2 alpha= alpha=2 alpha=3 alpha=5 alpha= Fig. 1. Plots of the Renyi entropy for several negative and positive values of α. The modification of the standard C4.5 algorithm has been done by simply replacing the Shannon measure with one of the two other entropies, as the goal here was to evaluate their influence on the properties of decision trees. This means that the final split criterion is based on the gain ratio: a test on attribute A that partitions the data D in two branches with D t and D f data, with a set of classes omega has gain value: G(ω, A D) =H(ω D) D t D H(ω D t) D f D H(ω D f ) (4) where D is the number of elements in the D set and H(ω S) isoneofthe3 entropies considered here: Shannon, Renyi or Tsallis. Parameter α has clearly

4 4 T. Maszczyk and W. Duch alpha= 0.6 alpha= 0.4 alpha= 0.2 alpha= alpha=2 alpha=3 alpha=5 alpha= Fig. 2. Plots of the Tsallis entropy for several negative and positive values of α Fig. 3. Plot of the Shannon entropy

5 Entropies in Decision Trees 5 an influence on what type of splits are going to be created, with preference for negative values of α given to rare events or longer tails of probability distribution. 3 Empirical study To evaluate the usefulness of Renyi and Tsallis entropy measures in decision trees the classical C4.5 algorithm has been modified and applied first to artificial data to verify its usefulness. The results were encouraging, therefore experiments on three data sets of gene expression profiles were carried out. Such data are characterized by large number of features and very small number of samples. In such situations several features may by pure statistical chance seem to be quite informative and allow for good generalization. Therefore it is deceivingly simple to reach high accuracy on such data, although it is very difficult to classify these data reliably. Only the simplest models may avoid overfitting, therefore decision trees providing simple rules may have an advantage over other models. A summary of these data sets is presented in Table 1, all data were downloaded from the Kent Ridge Bio-Medical Data Set Repository sg/gedatasets/datasets.html. Short description of these datasets follows: 1. Leukemia: training dataset consists of 38 bone marrow samples (27 ALL and 11 AML), over 7129 probes from 6817 human genes. Also 34 samples testing data is provided, with 20 ALL and 14 AML cases. 2. Colon Tumor: data contains 62 samples collected from the colon cancer patients. Among them, 40 tumor biopsies are from tumors (labeled as negative ) and 22 normal (labeled as positive ) biopsies are from healthy parts of the colons of the same patients. Two thousand out of around 6500 genes were pre-selected by the authors of the experiment based on the confidence in the measured expression levels. 3. Diffuse Large B-cell Lymphoma (DLBCL) is the most common subtype of non-hodgkins lymphoma. The data contains gene expression data from distinct types of such cells. There are 47 samples, 24 of them are from germinal centre B-like group while 23 are activated B-like group. Each sample is described by 4026 genes. Title #Genes #Samples #Samples per class Source Colon cancer tumor 22 normal Alon at all (1999) [8] DLBCL GCB 23 AB Alizadeh at all (2000) [9] Leukemia ALL 25 AML Golub at all (1999) [10] Table 1. Summary of data sets

6 6 T. Maszczyk and W. Duch 4 Experiment and results For each data set the standard C4.5 decision tree is used 10 times, each time averaging the 10-fold crossvalidation mode, followed on the same partitioning of data by the modified algorithms based on Tsallis and Renyi entropies for different values of parameter α. Results are collected in Tables 2-7, with accuracies and standard deviations for each dataset. The best values of α parameter differ in each case and can be easily determined from crossvalidation. Although overall results may not improve accuracy for different classes they may strongly differ in sensitivity and specificity, as can be seen in the tables below. Entropy Renyi 64.6± ± ± ± ± ± ± ±2.6 Tsallis 64.6± ± ± ± ± ± ± ± Renyi 78.8± ± ± ± ± ± ± ±2.2 Tsallis 73.0± ± ± ± ± ± ± ±4.4 Shannon 81.2±3.7 Table 2. Accuracy on Colon cancer data set; Shannon and α = 1 results are identical. Entropy Class Renyi 1 0.0± ± ± ± ± ± ± ± ± ± ± ± ± ±3.9 Tsallis 1 0.0± ± ± ± ± ± ± ± ± ± ± ± ± ±3.9 Class Renyi ± ± ± ± ± ± ± ± ± ± ± ± ± ±3.6 Tsallis ± ± ± ± ± ± ± ± ± ± ± ± ± ±3.4 Shannon ± ±4.8 Table 3. Accuracy per class on Colon cancer data set; Shannon and α = 1 results are identical. Results depend quite clearly on the α coefficient, with α = 1 always reproducing the Shannon entropy results. The optimal coefficient should be deter-

7 Entropies in Decision Trees 7 Entropy Renyi 46.0± ± ± ± ± ± ± ±4.9 Tsallis 52.4± ± ± ± ± ± ± ± Renyi 76.5± ± ± ± ± ± ± ±6.3 Tsallis 81.3± ± ± ± ± ± ± ±4.0 Shannon 78.5±4.8 Table 4. Accuracy on DLBCL data set Entropy Class Renyi ± ± ± ± ± ± ± ± ± ± ± ± ± ±5.9 Tsallis ± ± ± ± ± ± ± ± ± ± ± ± ± ±9.8 Class Renyi ± ± ± ± ± ± ± ± ± ± ± ± ± ±4.7 Tsallis ± ± ± ± ± ± ± ± ± ± ± ± ± ±4.9 Shannon ± ±8.7 Table 5. Accuracy per class on DLBCL data set Entropy Renyi 65.4 ± ± ± ± ± ± ±4.6 Tsallis 65.4 ± ± ± ± ± ± ± Renyi 80.5± ± ± ± ± ± ±2.0 Tsallis 82.5± ± ± ± ± ± ±3.6 Shannon 81.4±4.1 Table 6. Accuracy on Leukemia data set

8 8 T. Maszczyk and W. Duch Entropy Class Renyi 1 100± ± ± ± ± ± ± ± ± ± ± ± ± ±10.6 Tsallis 1 100± ± ± ± ± ± ± ± ± ± ± ± ± ±10.3 Class Renyi ± ± ± ± ± ± ± ± ± ± ± ± ± ±5.9 Tsallis ± ± ± ± ± ± ± ± ± ± ± ± ± ±6.5 Shannon ± ±5.7 Table 7. Accuracy per class on Leukemia data set mined through crossvalidation. For the Colon dataset (Tab. 2-3) peak accuracy is achieved for Renyi entropy with α = 2, with specificity (accuracy of the second class) significantly higher than for the Shannon case, and with smaller variance. Tsallis entropy seems to be very sensitive and does not improve over Shannon value around α = 1. For DLBCL both Renyi and Tsallis entropies with α in the range give the best results, improving both specificity and sensitivity of the Shannon measure (Tab. 4-5). For the Leukemia data best Renyi result for α = 0.1, around 88.5 ± 2.4 is significantly better than Shannon s 81.4 ± 4.1%, improving both the sensitivity and specificity and decreasing variance; results with α = 3 are also very good. Tsallis results for quite large α =5areeven better. Below one of the decision trees generated on the Leukemia data set with Renyi s entropy and α = 3 is presented: g760 > 588 : AML g760 <= 588 : c1926 <= 134 : ALL c1926 > 134 : AML Rules extracted from this tree are: Rule 1: g760 > > class AML [93.0%] Rule 2: g1926 > > class AML [89.1%]

9 Entropies in Decision Trees 9 Rule 3: g760 <= 588 g1926 <= > class ALL [93.9%] The trees are very small and should provide good generalization for small datasets, avoiding overfitting that other methods may suffer from. However, for gene expression data this may not be sufficient as the results are not stable against small perturbation of data (all learning methods suffer from this problem, see [?]). Therefore in practical applications a better approach is to define approximate coverings of redundant features and replace such groups of features with their linear combinations to reduce their dimensionality, aggregating information about similar genes (Biesiada and Duch, in preparation). For the demonstration of efficiency of non-standard entropy measures this is not so important. 5 Conclusions Information Theoretic Learning (ITL) has been used to train neural networks for blind source separation [11], definition of new error functions in neural networks [12], classification with labeled and unlabeled data, feature extraction and other applications [13]. The quadratic Renyi s error entropy has been used to minimize the average information content of the error signal for supervised adaptive system training. However, the use of non-shannon entropies in extraction of logical rules using decision trees has not yet been attempted. The importance of decision trees in large-scale data mining knowledge extraction applications and the simplicity of this approach encouraged us to explore this possibility. First experiments with modified split criterionof the C4.5 decision trees were presented here. Theoretical results and tests on artificial data show that this can be useful particularly for datasets with one or more small classes. Additional parameter α that may be easily adapted using crossvalidation tests gives possibility to tune the tree to the discrimination of classes of different size. This algorithm makes it more attractive than standard approach based on Shannon entropy that does not allow for exploration of the tradeoff between the probability of different classes and the overall information gain. This opens a way to applications in many types of decision trees, encouraging also modification of split criteria that are not based on entropy to account for the same tradeoff. An explicit formula for such tradeoff may be defined, aimed at separation of nodes of high purity at the expense of lower overall information gain. This as well as applications of non-standard entropies to the text classification data, where small classes are very important, remain to be explored. Bearing in mind the results and the simiplicity of the approach presented here the use of non-standard entropies may be highly recommended.

10 10 T. Maszczyk and W. Duch References 1. Duch, W., Setiono, R., Zurada, J.: Computational intelligence methods for understanding of data. Proceedings of the IEEE 92(5) (2004) Grochowski, M., Duch, W.: Learning highly non-separable boolean functions using constructive feedforward neural network. Lecture Notes in Computer Science 4668 (2007) Quinlan, J.: C 4.5: Programs for machine learning. Morgan Kaufmann, San Mateo, CA (1993) 4. Duch, W.: Filter methods. In Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L., eds.: Feature extraction, foundations and applications. Physica Verlag, Springer, Berlin, Heidelberg, New York (2006) Shannon, C., Weaver, W.: The Mathematical Theory of Communication. University of Illinois Press, Urbana, Ill. (1964) 6. Tsallis, C., Mendes, R., Plastino, A.: The role of constraints within generalized nonextensive statistics. Physica 261A (1998) Renyi, A.: Probability Theory. North-Holland, Amsterdam 8. Alon, U.: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. PNAS 96 (1999) Alizadeh, A.: Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403 (2000) Golub, T.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286 (1999) Hild, K., Erdogmus, D., Principe, J.: Blind source separation using renyi s mutual information. IEEE Signal Processing Letters 8 (2001) Erdogmus, D., Principe, J.: Generalized information potential criterion for adaptive system training. IEEE Trans. on Neural Networks 13 (2002) Hild, K., Erdogmus, D., Torkkola, K., Principe, J.: Feature extraction using information-theoretic learning. IEEE Trans. on Pattern Analysis and Machine Intelligence 28 (2006)

### Microarray Data Analysis: Discovery

Microarray Data Analysis: Discovery Lecture 5 Classification Classification vs. Clustering Classification: Goal: Placing objects (e.g. genes) into meaningful classes Supervised Clustering: Goal: Discover

### ECS 234: Data Analysis: Classification ECS 234

: Data Analysis: Classification Classification vs. Clustering Classification: Goal: Placing objects (e.g. genes) into meaningful classes Supervised Clustering: Goal: Discover meaningful classes Unsupervised

### FEATURE SELECTION COMBINED WITH RANDOM SUBSPACE ENSEMBLE FOR GENE EXPRESSION BASED DIAGNOSIS OF MALIGNANCIES

FEATURE SELECTION COMBINED WITH RANDOM SUBSPACE ENSEMBLE FOR GENE EXPRESSION BASED DIAGNOSIS OF MALIGNANCIES Alberto Bertoni, 1 Raffaella Folgieri, 1 Giorgio Valentini, 1 1 DSI, Dipartimento di Scienze

### Induction of Decision Trees

Induction of Decision Trees Peter Waiganjo Wagacha This notes are for ICS320 Foundations of Learning and Adaptive Systems Institute of Computer Science University of Nairobi PO Box 30197, 00200 Nairobi.

### Aijun An and Nick Cercone. Department of Computer Science, University of Waterloo. methods in a context of learning classication rules.

Discretization of Continuous Attributes for Learning Classication Rules Aijun An and Nick Cercone Department of Computer Science, University of Waterloo Waterloo, Ontario N2L 3G1 Canada Abstract. We present

### Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm

Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm Zhenqiu Liu, Dechang Chen 2 Department of Computer Science Wayne State University, Market Street, Frederick, MD 273,

### Generative MaxEnt Learning for Multiclass Classification

Generative Maximum Entropy Learning for Multiclass Classification A. Dukkipati, G. Pandey, D. Ghoshdastidar, P. Koley, D. M. V. S. Sriram Dept. of Computer Science and Automation Indian Institute of Science,

### A Posteriori Corrections to Classification Methods.

A Posteriori Corrections to Classification Methods. Włodzisław Duch and Łukasz Itert Department of Informatics, Nicholas Copernicus University, Grudziądzka 5, 87-100 Toruń, Poland; http://www.phys.uni.torun.pl/kmk

### 10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

10-810: Advanced Algorithms and Models for Computational Biology Optimal leaf ordering and classification Hierarchical clustering As we mentioned, its one of the most popular methods for clustering gene

### Classification Using Decision Trees

Classification Using Decision Trees 1. Introduction Data mining term is mainly used for the specific set of six activities namely Classification, Estimation, Prediction, Affinity grouping or Association

### Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,

### An Empirical Comparison of Dimensionality Reduction Methods for Classifying Gene and Protein Expression Datasets

An Empirical Comparison of Dimensionality Reduction Methods for Classifying Gene and Protein Expression Datasets George Lee 1, Carlos Rodriguez 2, and Anant Madabhushi 1 1 Rutgers, The State University

### CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#\$

### CS 6375 Machine Learning

CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

### Iterative Laplacian Score for Feature Selection

Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006,

### Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

### CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

### Active Sonar Target Classification Using Classifier Ensembles

International Journal of Engineering Research and Technology. ISSN 0974-3154 Volume 11, Number 12 (2018), pp. 2125-2133 International Research Publication House http://www.irphouse.com Active Sonar Target

### 43400 Serdang Selangor, Malaysia Serdang Selangor, Malaysia 4

An Extended ID3 Decision Tree Algorithm for Spatial Data Imas Sukaesih Sitanggang # 1, Razali Yaakob #2, Norwati Mustapha #3, Ahmad Ainuddin B Nuruddin *4 # Faculty of Computer Science and Information

### Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature

### Decision Trees Entropy, Information Gain, Gain Ratio

Changelog: 14 Oct, 30 Oct Decision Trees Entropy, Information Gain, Gain Ratio Lecture 3: Part 2 Outline Entropy Information gain Gain ratio Marina Santini Acknowledgements Slides borrowed and adapted

### Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model

### On Improving the k-means Algorithm to Classify Unclassified Patterns

On Improving the k-means Algorithm to Classify Unclassified Patterns Mohamed M. Rizk 1, Safar Mohamed Safar Alghamdi 2 1 Mathematics & Statistics Department, Faculty of Science, Taif University, Taif,

### MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

### Statistics Applied to Bioinformatics. Tests of homogeneity

Statistics Applied to Bioinformatics Tests of homogeneity Two-tailed test of homogeneity Two-tailed test H 0 :m = m Principle of the test Estimate the difference between m and m Compare this estimation

### the tree till a class assignment is reached

Decision Trees Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached Definitions Internal

### Decision trees COMS 4771

Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).

### Decision trees COMS 4771

Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).

### CHAPTER-17. Decision Tree Induction

CHAPTER-17 Decision Tree Induction 17.1 Introduction 17.2 Attribute selection measure 17.3 Tree Pruning 17.4 Extracting Classification Rules from Decision Trees 17.5 Bayesian Classification 17.6 Bayes

### Chapter 6: Classification

Chapter 6: Classification 1) Introduction Classification problem, evaluation of classifiers, prediction 2) Bayesian Classifiers Bayes classifier, naive Bayes classifier, applications 3) Linear discriminant

### A Bias Correction for the Minimum Error Rate in Cross-validation

A Bias Correction for the Minimum Error Rate in Cross-validation Ryan J. Tibshirani Robert Tibshirani Abstract Tuning parameters in supervised learning problems are often estimated by cross-validation.

### Data classification (II)

Lecture 4: Data classification (II) Data Mining - Lecture 4 (2016) 1 Outline Decision trees Choice of the splitting attribute ID3 C4.5 Classification rules Covering algorithms Naïve Bayes Classification

### Introduction to Machine Learning Midterm Exam

10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

### Some New Results on Information Properties of Mixture Distributions

Filomat 31:13 (2017), 4225 4230 https://doi.org/10.2298/fil1713225t Published by Faculty of Sciences and Mathematics, University of Niš, Serbia Available at: http://www.pmf.ni.ac.rs/filomat Some New Results

### Decision Tree Learning Lecture 2

Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

### hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science

### A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab

### Classification and regression trees

Classification and regression trees Pierre Geurts p.geurts@ulg.ac.be Last update: 23/09/2015 1 Outline Supervised learning Decision tree representation Decision tree learning Extensions Regression trees

### Machine Learning Recitation 8 Oct 21, Oznur Tastan

Machine Learning 10601 Recitation 8 Oct 21, 2009 Oznur Tastan Outline Tree representation Brief information theory Learning decision trees Bagging Random forests Decision trees Non linear classifier Easy

### Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees! Summary! Input Knowledge representation! Preparing data for learning! Input: Concept, Instances, Attributes"

### A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohio-state.edu/ yklee May 13, 2005

### Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

### Lecture 3: Decision Trees

Lecture 3: Decision Trees Cognitive Systems - Machine Learning Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning last change October 25, 2017 Ute Schmid (CogSys,

### Lecture 3: Decision Trees

Lecture 3: Decision Trees Cognitive Systems - Machine Learning Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning last change November 26, 2014 Ute Schmid (CogSys,

### SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS Filippo Portera and Alessandro Sperduti Dipartimento di Matematica Pura ed Applicata Universit a di Padova, Padova, Italy {portera,sperduti}@math.unipd.it

### Rule Generation using Decision Trees

Rule Generation using Decision Trees Dr. Rajni Jain 1. Introduction A DT is a classification scheme which generates a tree and a set of rules, representing the model of different classes, from a given

### Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

Data Mining Andrew Kusiak 2139 Seamans Center Iowa City, Iowa 52242-1527 Preamble: Control Application Goal: Maintain T ~Td Tel: 319-335 5934 Fax: 319-335 5669 andrew-kusiak@uiowa.edu http://www.icaen.uiowa.edu/~ankusiak

### Molecular Classification of Biological Phenotypes

School of Informatics November 13, 2008 Outline Introduction Class Comparison Class Discovery Class Prediction Example Biological states and state modulation Chemical Genomics Toxicogenomics Software Tools

### Classification: Decision Trees

Classification: Decision Trees Outline Top-Down Decision Tree Construction Choosing the Splitting Attribute Information Gain and Gain Ratio 2 DECISION TREE An internal node is a test on an attribute. A

### Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

### Decision Trees (Cont.)

Decision Trees (Cont.) R&N Chapter 18.2,18.3 Side example with discrete (categorical) attributes: Predicting age (3 values: less than 30, 30-45, more than 45 yrs old) from census data. Attributes (split

### Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

### Learning Decision Trees

Learning Decision Trees Machine Learning Spring 2018 1 This lecture: Learning Decision Trees 1. Representation: What are decision trees? 2. Algorithm: Learning decision trees The ID3 algorithm: A greedy

### VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

03/Feb/2010 VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms Presented by Andriy Temko Department of Electrical and Electronic Engineering Page 2 of

### Holdout and Cross-Validation Methods Overfitting Avoidance

Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest

### ECE662: Pattern Recognition and Decision Making Processes: HW TWO

ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are

### Empirical Risk Minimization, Model Selection, and Model Assessment

Empirical Risk Minimization, Model Selection, and Model Assessment CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 5.7-5.7.2.4, 6.5-6.5.3.1 Dietterich,

### Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant

### Discriminant analysis and supervised classification

Discriminant analysis and supervised classification Angela Montanari 1 Linear discriminant analysis Linear discriminant analysis (LDA) also known as Fisher s linear discriminant analysis or as Canonical

### Introduction to Supervised Learning. Performance Evaluation

Introduction to Supervised Learning Performance Evaluation Marcelo S. Lauretto Escola de Artes, Ciências e Humanidades, Universidade de São Paulo marcelolauretto@usp.br Lima - Peru Performance Evaluation

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Lecture 06 - Regression & Decision Trees Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom

### CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification Clustering

### SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

### Finite Mixture Model of Bounded Semi-naive Bayesian Networks Classifier

Finite Mixture Model of Bounded Semi-naive Bayesian Networks Classifier Kaizhu Huang, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin,

### Feature selection and extraction Spectral domain quality estimation Alternatives

Feature selection and extraction Error estimation Maa-57.3210 Data Classification and Modelling in Remote Sensing Markus Törmä markus.torma@tkk.fi Measurements Preprocessing: Remove random and systematic

### Performance Evaluation and Comparison

Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation

### Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Introduction to ML Two examples of Learners: Naïve Bayesian Classifiers Decision Trees Why Bayesian learning? Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical

### Lecture 7 Decision Tree Classifier

Machine Learning Dr.Ammar Mohammed Lecture 7 Decision Tree Classifier Decision Tree A decision tree is a simple classifier in the form of a hierarchical tree structure, which performs supervised classification

### Using HDDT to avoid instances propagation in unbalanced and evolving data streams

Using HDDT to avoid instances propagation in unbalanced and evolving data streams IJCNN 2014 Andrea Dal Pozzolo, Reid Johnson, Olivier Caelen, Serge Waterschoot, Nitesh V Chawla and Gianluca Bontempi 07/07/2014

### A Tutorial on Support Vector Machine

A Tutorial on School of Computing National University of Singapore Contents Theory on Using with Other s Contents Transforming Theory on Using with Other s What is a classifier? A function that maps instances

### The Lady Tasting Tea. How to deal with multiple testing. Need to explore many models. More Predictive Modeling

The Lady Tasting Tea More Predictive Modeling R. A. Fisher & the Lady B. Muriel Bristol claimed she prefers tea added to milk rather than milk added to tea Fisher was skeptical that she could distinguish

### Machine Learning, Midterm Exam

10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have

### Feature Selection for SVMs

Feature Selection for SVMs J. Weston, S. Mukherjee, O. Chapelle, M. Pontil T. Poggio, V. Vapnik, Barnhill BioInformatics.com, Savannah, Georgia, USA. CBCL MIT, Cambridge, Massachusetts, USA. AT&T Research

### Decision Tree Learning

Decision Tree Learning Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Machine Learning, Chapter 3 2. Data Mining: Concepts, Models,

### Growing a Large Tree

STAT 5703 Fall, 2004 Data Mining Methodology I Decision Tree I Growing a Large Tree Contents 1 A Single Split 2 1.1 Node Impurity.................................. 2 1.2 Computation of i(t)................................

### Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu 02-01-2018 Biomedical data are usually high-dimensional Number of samples (n) is relatively small whereas number of features (p) can be large Sometimes p>>n Problems

### COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative

### Artificial Intelligence Roman Barták

Artificial Intelligence Roman Barták Department of Theoretical Computer Science and Mathematical Logic Introduction We will describe agents that can improve their behavior through diligent study of their

### Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Decision Trees Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Supervised Learning Input: labelled training data i.e., data plus desired output Assumption:

### brainlinksystem.com \$25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

I Decision Tree Learning Part I brainlinksystem.com \$25+ / hr Illah Nourbakhsh s version Chapter 8, Russell and Norvig Thanks to all past instructors Carnegie Mellon Outline Learning and philosophy Induction

### Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

### Non-Negative Factorization for Clustering of Microarray Data

INT J COMPUT COMMUN, ISSN 1841-9836 9(1):16-23, February, 2014. Non-Negative Factorization for Clustering of Microarray Data L. Morgos Lucian Morgos Dept. of Electronics and Telecommunications Faculty

### COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection Instructor: Herke van Hoof (herke.vanhoof@cs.mcgill.ca) Based on slides by:, Jackie Chi Kit Cheung Class web page:

### Understanding Generalization Error: Bounds and Decompositions

CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

### CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING Santiago Ontañón so367@drexel.edu Summary so far: Rational Agents Problem Solving Systematic Search: Uninformed Informed Local Search Adversarial Search

### CS 380: ARTIFICIAL INTELLIGENCE

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING 11/11/2013 Santiago Ontañón santi@cs.drexel.edu https://www.cs.drexel.edu/~santi/teaching/2013/cs380/intro.html Summary so far: Rational Agents Problem

### International Journal "Information Theories & Applications" Vol.14 /

International Journal "Information Theories & Applications" Vol.4 / 2007 87 or 2) Nˆ t N. That criterion and parameters F, M, N assign method of constructing sample decision function. In order to estimate

### Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Machine Learning! in just a few minutes Jan Peters Gerhard Neumann 1 Purpose of this Lecture Foundations of machine learning tools for robotics We focus on regression methods and general principles Often

### Decision Tree And Random Forest

Decision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany) Spring 2019 Contact: mailto: Ammar@cu.edu.eg

### ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

MACHINE LEARNING Definition 1: Learning is constructing or modifying representations of what is being experienced [Michalski 1986], p. 10 Definition 2: Learning denotes changes in the system That are adaptive

### Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lesson 1 5 October 2016 Learning and Evaluation of Pattern Recognition Processes Outline Notation...2 1. The

### Gene Expression Data Classification With Kernel Principal Component Analysis

Journal of Biomedicine and Biotechnology 25:2 25 55 59 DOI:.55/JBB.25.55 RESEARCH ARTICLE Gene Expression Data Classification With Kernel Principal Component Analysis Zhenqiu Liu, Dechang Chen, 2 and Halima

### Parametric Techniques

Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

### Predictive Analytics on Accident Data Using Rule Based and Discriminative Classifiers

Advances in Computational Sciences and Technology ISSN 0973-6107 Volume 10, Number 3 (2017) pp. 461-469 Research India Publications http://www.ripublication.com Predictive Analytics on Accident Data Using

### 15-381: Artificial Intelligence. Decision trees

15-381: Artificial Intelligence Decision trees Bayes classifiers find the label that maximizes: Naïve Bayes models assume independence of the features given the label leading to the following over documents

### Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

### Supervised Learning. George Konidaris

Supervised Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom Mitchell,