# Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees

Save this PDF as:

Size: px
Start display at page:

Download "Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees"

## Transcription

1 Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees Tomasz Maszczyk and W lodzis law Duch Department of Informatics, Nicolaus Copernicus University Grudzi adzka 5, Toruń, Poland Abstract. Shannon entropy used in standard top-down decision trees does not guarantee the best generalization. Split criteria based on generalized entropies offer different compromise between purity of nodes and overall information gain. Modified C4.5 decision trees based on Tsallis and Renyi entropies have been tested on several high-dimensional microarray datasets with interesting results. This approach may be used in any decision tree and information selection algorithm. Key words: Decision rules, entropy, information theory, information selection, decision trees 1 Introduction Decision tree algorithms are still the foundation of most large data mining packages, offering easy and computationally efficient way to extract simple decision rules [1]. They should always be used as a reference, with more complex classification models justified only if they give significant improvement. Trees are based on recursive partitioning of data and unlike most learning systems they use different sets of features in different parts of the feature space, automatically performing local feature selection. This is an important and unique property of general divide-and-conquer algorithms that has not been paid much attention. The hidden nodes in the first neural layer weight the inputs in a different way, calculating specific projections of the input data on a line defined by the weights g k = W (k) X. This is still non-local feature that captures information from a whole sector of the input space, not from the localized region. On the other hand localized radial basis functions capture only the local information around some reference points. Recursive partitioning in decision trees is capable of capturing local information in some dimensions and non-local in others. This is a desirable property that may be used in neural algorithms based on localized projected data [2]. The C4.5 algorithm to generate trees [3] is still the basis of the most popular approach in this field. Tests for partitioning data in C4.5 decision trees are based on the concept of information entropy and applied to each feature x 1,x 2,... individually. Such tests create two nodes that should on the one hand contain

2 2 T. Maszczyk and W. Duch data that are as pure as possible (i.e. belong to a single class), and on the other hand increase overall separability of data. Tests that are based directly on indices measuring accuracy are optimal from Bayesian point of view, but are not so accurate as those based on information theory that may be evaluated with greater precision [4]. Choice of the test is always a hidden compromise in how much weight is put on the purity of samples in one or both nodes and the total gain achieved by partitioning of data. It is therefore worthwhile to test other types of entropies that may be used as tests. Essentially the same reasoning may be used in applications of entropy-based indices in feature selection. For some data features that can help to distinguish rare cases are important, but standard approaches may rank them quite low. In the next section properties of Shannon, Renyi and Tsallis entropies are described. As an example of application three microarray datasets are analyzed in the third section. This type of application is especially interesting for decision trees because of the high dimensionality of microarray data, the need to identify important genes and find simple decision rules. More sophisticated learning systems do not seem to achieve significantly higher accuracy. Conclusions are given in section four. 2 Theoretical framework Entropy is the measure of disorder in physical systems, or an amount of information that may be gained by observations of disordered systems. Claude Shannon defined a formal measure of entropy, called Shannon entropy[5]: S = n p i log 2 p i (1) i=1 where p i is the probability of occurrence of an event (feature value) x i being an element of the event (feature) X that can take values {x 1...x n }. The Shannon entropy is a decreasing function of a scattering of random variable, and is maximal when all the outcomes are equally likely. Shannon entropy is may be used globally, for the whole data, or locally, to evaluate entropy of probability density distributions around some points. This notion of entropy can be generalized to provide additional information about the importance of specific events, for example outliers or rare events. Comparing entropy of two distributions, corresponding for example to two features, Shannon entropy assumes implicite certain tradeoff between contributions from the tails and the main mass of this distribution. It should be worthwhile to control this tradeoff explicitly, as in many cases it may be important to distinguish weak signal overlapping with much stronger one. Entropy measures that depend on powers of probability, n i=1 p(x i) α, provide such control. If α has large positive value this measure is more sensitive to events that occur often, while for large negative α it is more sensitive to the events which happen seldom.

3 Entropies in Decision Trees 3 Constantino Tsallis [6] and Alfred Renyi [7] both proposed generalized entropies that for α = 1 reduce to the Shannon entropy. The Renyi entropy is defined as [7]: ( I α = 1 n ) 1 α log p α i (2) i=1 It has similar properties as the Shannon entropy: it is additive it has maximum = ln(n) forp i =1/n but it contains additional parameter α which can be used to make it more or less sensitive to the shape of probability distributions. Tsallis defined his entropy as: S α = 1 α 1 ( 1 Figures 1-3 shows illustration and comparison of Renyi, Tsallis and Shannon entropies for two probabilities p 1 and p 2 where p 1 =1 p 2. n i=1 p α i ) (3) alpha= 8 alpha= 3 alpha= 2 alpha= alpha=2 alpha=3 alpha=5 alpha= Fig. 1. Plots of the Renyi entropy for several negative and positive values of α. The modification of the standard C4.5 algorithm has been done by simply replacing the Shannon measure with one of the two other entropies, as the goal here was to evaluate their influence on the properties of decision trees. This means that the final split criterion is based on the gain ratio: a test on attribute A that partitions the data D in two branches with D t and D f data, with a set of classes omega has gain value: G(ω, A D) =H(ω D) D t D H(ω D t) D f D H(ω D f ) (4) where D is the number of elements in the D set and H(ω S) isoneofthe3 entropies considered here: Shannon, Renyi or Tsallis. Parameter α has clearly

4 4 T. Maszczyk and W. Duch alpha= 0.6 alpha= 0.4 alpha= 0.2 alpha= alpha=2 alpha=3 alpha=5 alpha= Fig. 2. Plots of the Tsallis entropy for several negative and positive values of α Fig. 3. Plot of the Shannon entropy

5 Entropies in Decision Trees 5 an influence on what type of splits are going to be created, with preference for negative values of α given to rare events or longer tails of probability distribution. 3 Empirical study To evaluate the usefulness of Renyi and Tsallis entropy measures in decision trees the classical C4.5 algorithm has been modified and applied first to artificial data to verify its usefulness. The results were encouraging, therefore experiments on three data sets of gene expression profiles were carried out. Such data are characterized by large number of features and very small number of samples. In such situations several features may by pure statistical chance seem to be quite informative and allow for good generalization. Therefore it is deceivingly simple to reach high accuracy on such data, although it is very difficult to classify these data reliably. Only the simplest models may avoid overfitting, therefore decision trees providing simple rules may have an advantage over other models. A summary of these data sets is presented in Table 1, all data were downloaded from the Kent Ridge Bio-Medical Data Set Repository sg/gedatasets/datasets.html. Short description of these datasets follows: 1. Leukemia: training dataset consists of 38 bone marrow samples (27 ALL and 11 AML), over 7129 probes from 6817 human genes. Also 34 samples testing data is provided, with 20 ALL and 14 AML cases. 2. Colon Tumor: data contains 62 samples collected from the colon cancer patients. Among them, 40 tumor biopsies are from tumors (labeled as negative ) and 22 normal (labeled as positive ) biopsies are from healthy parts of the colons of the same patients. Two thousand out of around 6500 genes were pre-selected by the authors of the experiment based on the confidence in the measured expression levels. 3. Diffuse Large B-cell Lymphoma (DLBCL) is the most common subtype of non-hodgkins lymphoma. The data contains gene expression data from distinct types of such cells. There are 47 samples, 24 of them are from germinal centre B-like group while 23 are activated B-like group. Each sample is described by 4026 genes. Title #Genes #Samples #Samples per class Source Colon cancer tumor 22 normal Alon at all (1999) [8] DLBCL GCB 23 AB Alizadeh at all (2000) [9] Leukemia ALL 25 AML Golub at all (1999) [10] Table 1. Summary of data sets

6 6 T. Maszczyk and W. Duch 4 Experiment and results For each data set the standard C4.5 decision tree is used 10 times, each time averaging the 10-fold crossvalidation mode, followed on the same partitioning of data by the modified algorithms based on Tsallis and Renyi entropies for different values of parameter α. Results are collected in Tables 2-7, with accuracies and standard deviations for each dataset. The best values of α parameter differ in each case and can be easily determined from crossvalidation. Although overall results may not improve accuracy for different classes they may strongly differ in sensitivity and specificity, as can be seen in the tables below. Entropy Renyi 64.6± ± ± ± ± ± ± ±2.6 Tsallis 64.6± ± ± ± ± ± ± ± Renyi 78.8± ± ± ± ± ± ± ±2.2 Tsallis 73.0± ± ± ± ± ± ± ±4.4 Shannon 81.2±3.7 Table 2. Accuracy on Colon cancer data set; Shannon and α = 1 results are identical. Entropy Class Renyi 1 0.0± ± ± ± ± ± ± ± ± ± ± ± ± ±3.9 Tsallis 1 0.0± ± ± ± ± ± ± ± ± ± ± ± ± ±3.9 Class Renyi ± ± ± ± ± ± ± ± ± ± ± ± ± ±3.6 Tsallis ± ± ± ± ± ± ± ± ± ± ± ± ± ±3.4 Shannon ± ±4.8 Table 3. Accuracy per class on Colon cancer data set; Shannon and α = 1 results are identical. Results depend quite clearly on the α coefficient, with α = 1 always reproducing the Shannon entropy results. The optimal coefficient should be deter-

7 Entropies in Decision Trees 7 Entropy Renyi 46.0± ± ± ± ± ± ± ±4.9 Tsallis 52.4± ± ± ± ± ± ± ± Renyi 76.5± ± ± ± ± ± ± ±6.3 Tsallis 81.3± ± ± ± ± ± ± ±4.0 Shannon 78.5±4.8 Table 4. Accuracy on DLBCL data set Entropy Class Renyi ± ± ± ± ± ± ± ± ± ± ± ± ± ±5.9 Tsallis ± ± ± ± ± ± ± ± ± ± ± ± ± ±9.8 Class Renyi ± ± ± ± ± ± ± ± ± ± ± ± ± ±4.7 Tsallis ± ± ± ± ± ± ± ± ± ± ± ± ± ±4.9 Shannon ± ±8.7 Table 5. Accuracy per class on DLBCL data set Entropy Renyi 65.4 ± ± ± ± ± ± ±4.6 Tsallis 65.4 ± ± ± ± ± ± ± Renyi 80.5± ± ± ± ± ± ±2.0 Tsallis 82.5± ± ± ± ± ± ±3.6 Shannon 81.4±4.1 Table 6. Accuracy on Leukemia data set

8 8 T. Maszczyk and W. Duch Entropy Class Renyi 1 100± ± ± ± ± ± ± ± ± ± ± ± ± ±10.6 Tsallis 1 100± ± ± ± ± ± ± ± ± ± ± ± ± ±10.3 Class Renyi ± ± ± ± ± ± ± ± ± ± ± ± ± ±5.9 Tsallis ± ± ± ± ± ± ± ± ± ± ± ± ± ±6.5 Shannon ± ±5.7 Table 7. Accuracy per class on Leukemia data set mined through crossvalidation. For the Colon dataset (Tab. 2-3) peak accuracy is achieved for Renyi entropy with α = 2, with specificity (accuracy of the second class) significantly higher than for the Shannon case, and with smaller variance. Tsallis entropy seems to be very sensitive and does not improve over Shannon value around α = 1. For DLBCL both Renyi and Tsallis entropies with α in the range give the best results, improving both specificity and sensitivity of the Shannon measure (Tab. 4-5). For the Leukemia data best Renyi result for α = 0.1, around 88.5 ± 2.4 is significantly better than Shannon s 81.4 ± 4.1%, improving both the sensitivity and specificity and decreasing variance; results with α = 3 are also very good. Tsallis results for quite large α =5areeven better. Below one of the decision trees generated on the Leukemia data set with Renyi s entropy and α = 3 is presented: g760 > 588 : AML g760 <= 588 : c1926 <= 134 : ALL c1926 > 134 : AML Rules extracted from this tree are: Rule 1: g760 > > class AML [93.0%] Rule 2: g1926 > > class AML [89.1%]

9 Entropies in Decision Trees 9 Rule 3: g760 <= 588 g1926 <= > class ALL [93.9%] The trees are very small and should provide good generalization for small datasets, avoiding overfitting that other methods may suffer from. However, for gene expression data this may not be sufficient as the results are not stable against small perturbation of data (all learning methods suffer from this problem, see [?]). Therefore in practical applications a better approach is to define approximate coverings of redundant features and replace such groups of features with their linear combinations to reduce their dimensionality, aggregating information about similar genes (Biesiada and Duch, in preparation). For the demonstration of efficiency of non-standard entropy measures this is not so important. 5 Conclusions Information Theoretic Learning (ITL) has been used to train neural networks for blind source separation [11], definition of new error functions in neural networks [12], classification with labeled and unlabeled data, feature extraction and other applications [13]. The quadratic Renyi s error entropy has been used to minimize the average information content of the error signal for supervised adaptive system training. However, the use of non-shannon entropies in extraction of logical rules using decision trees has not yet been attempted. The importance of decision trees in large-scale data mining knowledge extraction applications and the simplicity of this approach encouraged us to explore this possibility. First experiments with modified split criterionof the C4.5 decision trees were presented here. Theoretical results and tests on artificial data show that this can be useful particularly for datasets with one or more small classes. Additional parameter α that may be easily adapted using crossvalidation tests gives possibility to tune the tree to the discrimination of classes of different size. This algorithm makes it more attractive than standard approach based on Shannon entropy that does not allow for exploration of the tradeoff between the probability of different classes and the overall information gain. This opens a way to applications in many types of decision trees, encouraging also modification of split criteria that are not based on entropy to account for the same tradeoff. An explicit formula for such tradeoff may be defined, aimed at separation of nodes of high purity at the expense of lower overall information gain. This as well as applications of non-standard entropies to the text classification data, where small classes are very important, remain to be explored. Bearing in mind the results and the simiplicity of the approach presented here the use of non-standard entropies may be highly recommended.

10 10 T. Maszczyk and W. Duch References 1. Duch, W., Setiono, R., Zurada, J.: Computational intelligence methods for understanding of data. Proceedings of the IEEE 92(5) (2004) Grochowski, M., Duch, W.: Learning highly non-separable boolean functions using constructive feedforward neural network. Lecture Notes in Computer Science 4668 (2007) Quinlan, J.: C 4.5: Programs for machine learning. Morgan Kaufmann, San Mateo, CA (1993) 4. Duch, W.: Filter methods. In Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L., eds.: Feature extraction, foundations and applications. Physica Verlag, Springer, Berlin, Heidelberg, New York (2006) Shannon, C., Weaver, W.: The Mathematical Theory of Communication. University of Illinois Press, Urbana, Ill. (1964) 6. Tsallis, C., Mendes, R., Plastino, A.: The role of constraints within generalized nonextensive statistics. Physica 261A (1998) Renyi, A.: Probability Theory. North-Holland, Amsterdam 8. Alon, U.: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. PNAS 96 (1999) Alizadeh, A.: Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403 (2000) Golub, T.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286 (1999) Hild, K., Erdogmus, D., Principe, J.: Blind source separation using renyi s mutual information. IEEE Signal Processing Letters 8 (2001) Erdogmus, D., Principe, J.: Generalized information potential criterion for adaptive system training. IEEE Trans. on Neural Networks 13 (2002) Hild, K., Erdogmus, D., Torkkola, K., Principe, J.: Feature extraction using information-theoretic learning. IEEE Trans. on Pattern Analysis and Machine Intelligence 28 (2006)

### 10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

10-810: Advanced Algorithms and Models for Computational Biology Optimal leaf ordering and classification Hierarchical clustering As we mentioned, its one of the most popular methods for clustering gene

### Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,

### Molecular Classification of Biological Phenotypes

School of Informatics November 13, 2008 Outline Introduction Class Comparison Class Discovery Class Prediction Example Biological states and state modulation Chemical Genomics Toxicogenomics Software Tools

### the tree till a class assignment is reached

Decision Trees Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached Definitions Internal

### A Bias Correction for the Minimum Error Rate in Cross-validation

A Bias Correction for the Minimum Error Rate in Cross-validation Ryan J. Tibshirani Robert Tibshirani Abstract Tuning parameters in supervised learning problems are often estimated by cross-validation.

### Decision Trees Entropy, Information Gain, Gain Ratio

Changelog: 14 Oct, 30 Oct Decision Trees Entropy, Information Gain, Gain Ratio Lecture 3: Part 2 Outline Entropy Information gain Gain ratio Marina Santini Acknowledgements Slides borrowed and adapted

### SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS Filippo Portera and Alessandro Sperduti Dipartimento di Matematica Pura ed Applicata Universit a di Padova, Padova, Italy {portera,sperduti}@math.unipd.it

### Decision Trees (Cont.)

Decision Trees (Cont.) R&N Chapter 18.2,18.3 Side example with discrete (categorical) attributes: Predicting age (3 values: less than 30, 30-45, more than 45 yrs old) from census data. Attributes (split

### Introduction to Machine Learning Midterm Exam Solutions

10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

### Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees! Summary! Input Knowledge representation! Preparing data for learning! Input: Concept, Instances, Attributes"

### CHAPTER-17. Decision Tree Induction

CHAPTER-17 Decision Tree Induction 17.1 Introduction 17.2 Attribute selection measure 17.3 Tree Pruning 17.4 Extracting Classification Rules from Decision Trees 17.5 Bayesian Classification 17.6 Bayes

### Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

### Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

Data Mining Andrew Kusiak 2139 Seamans Center Iowa City, Iowa 52242-1527 Preamble: Control Application Goal: Maintain T ~Td Tel: 319-335 5934 Fax: 319-335 5669 andrew-kusiak@uiowa.edu http://www.icaen.uiowa.edu/~ankusiak

### Machine Learning, Midterm Exam

10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have

### Holdout and Cross-Validation Methods Overfitting Avoidance

Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest

### Decision Tree Learning

Decision Tree Learning Goals for the lecture you should understand the following concepts the decision tree representation the standard top-down approach to learning a tree Occam s razor entropy and information

### Growing a Large Tree

STAT 5703 Fall, 2004 Data Mining Methodology I Decision Tree I Growing a Large Tree Contents 1 A Single Split 2 1.1 Node Impurity.................................. 2 1.2 Computation of i(t)................................

### Introduction to Supervised Learning. Performance Evaluation

Introduction to Supervised Learning Performance Evaluation Marcelo S. Lauretto Escola de Artes, Ciências e Humanidades, Universidade de São Paulo marcelolauretto@usp.br Lima - Peru Performance Evaluation

### Feature selection and extraction Spectral domain quality estimation Alternatives

Feature selection and extraction Error estimation Maa-57.3210 Data Classification and Modelling in Remote Sensing Markus Törmä markus.torma@tkk.fi Measurements Preprocessing: Remove random and systematic

### Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Introduction to ML Two examples of Learners: Naïve Bayesian Classifiers Decision Trees Why Bayesian learning? Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical

### Stats notes Chapter 5 of Data Mining From Witten and Frank

Stats notes Chapter 5 of Data Mining From Witten and Frank 5 Credibility: Evaluating what s been learned Issues: training, testing, tuning Predicting performance: confidence limits Holdout, cross-validation,

### Statistical aspects of prediction models with high-dimensional data

Statistical aspects of prediction models with high-dimensional data Anne Laure Boulesteix Institut für Medizinische Informationsverarbeitung, Biometrie und Epidemiologie February 15th, 2017 Typeset by

### Learning Bayesian Networks Does Not Have to Be NP-Hard

Learning Bayesian Networks Does Not Have to Be NP-Hard Norbert Dojer Institute of Informatics, Warsaw University, Banacha, 0-097 Warszawa, Poland dojer@mimuw.edu.pl Abstract. We propose an algorithm for

### Decision Tree Learning

Decision Tree Learning Goals for the lecture you should understand the following concepts the decision tree representation the standard top-down approach to learning a tree Occam s razor entropy and information

### Generative v. Discriminative classifiers Intuition

Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative

### Machine Learning (CS 567) Lecture 2

Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

### C4.5 - pruning decision trees

C4.5 - pruning decision trees Quiz 1 Quiz 1 Q: Is a tree with only pure leafs always the best classifier you can have? A: No. Quiz 1 Q: Is a tree with only pure leafs always the best classifier you can

### Optimal transfer function neural networks

Optimal transfer function neural networks Norbert Jankowski and Włodzisław Duch Department of Computer ethods Nicholas Copernicus University ul. Grudziądzka, 87 Toru ń, Poland, e-mail:{norbert,duch}@phys.uni.torun.pl

### Ensemble determination using the TOPSIS decision support system in multi-objective evolutionary neural network classifiers

Ensemble determination using the TOPSIS decision support system in multi-obective evolutionary neural network classifiers M. Cruz-Ramírez, J.C. Fernández, J. Sánchez-Monedero, F. Fernández-Navarro, C.

### IMBALANCED DATA. Phishing. Admin 9/30/13. Assignment 3: - how did it go? - do the experiments help? Assignment 4. Course feedback

9/3/3 Admin Assignment 3: - how did it go? - do the experiments help? Assignment 4 IMBALANCED DATA Course feedback David Kauchak CS 45 Fall 3 Phishing 9/3/3 Setup Imbalanced data. for hour, google collects

### CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING Santiago Ontañón so367@drexel.edu Summary so far: Rational Agents Problem Solving Systematic Search: Uninformed Informed Local Search Adversarial Search

### Learning Decision Trees

Learning Decision Trees CS194-10 Fall 2011 Lecture 8 CS194-10 Fall 2011 Lecture 8 1 Outline Decision tree models Tree construction Tree pruning Continuous input features CS194-10 Fall 2011 Lecture 8 2

### Graph Theoretic Latent Class Discovery

Graph Theoretic Latent Class Discovery Jeff Solka jsolka@gmu.edu NSWCDD/GMU GMU BINF Colloquium 2/24/04 p.1/28 Agenda What is latent class discovery? What are some approaches to the latent class discovery

if a>10 and b

### Classification and Regression Trees

Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity

### Biochip informatics-(i)

Biochip informatics-(i) : biochip normalization & differential expression Ju Han Kim, M.D., Ph.D. SNUBI: SNUBiomedical Informatics http://www.snubi snubi.org/ Biochip Informatics - (I) Biochip basics Preprocessing

### Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lesson 1 5 October 2016 Learning and Evaluation of Pattern Recognition Processes Outline Notation...2 1. The

### Unsupervised Anomaly Detection for High Dimensional Data

Unsupervised Anomaly Detection for High Dimensional Data Department of Mathematics, Rowan University. July 19th, 2013 International Workshop in Sequential Methodologies (IWSM-2013) Outline of Talk Motivation

### 10-701/ Machine Learning, Fall

0-70/5-78 Machine Learning, Fall 2003 Homework 2 Solution If you have questions, please contact Jiayong Zhang .. (Error Function) The sum-of-squares error is the most common training

### Kernel expansions with unlabeled examples

Kernel expansions with unlabeled examples Martin Szummer MIT AI Lab & CBCL Cambridge, MA szummer@ai.mit.edu Tommi Jaakkola MIT AI Lab Cambridge, MA tommi@ai.mit.edu Abstract Modern classification applications

### THE presence of missing values in a dataset often makes

1 Efficient EM Training of Gaussian Mixtures with Missing Data Olivier Delalleau, Aaron Courville, and Yoshua Bengio arxiv:1209.0521v1 [cs.lg] 4 Sep 2012 Abstract In data-mining applications, we are frequently

### Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

### PATTERN CLASSIFICATION

PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

### L11: Pattern recognition principles

L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

### Induction on Decision Trees

Séance «IDT» de l'ue «apprentissage automatique» Bruno Bouzy bruno.bouzy@parisdescartes.fr www.mi.parisdescartes.fr/~bouzy Outline Induction task ID3 Entropy (disorder) minimization Noise Unknown attribute

### Study on Classification Methods Based on Three Different Learning Criteria. Jae Kyu Suhr

Study on Classification Methods Based on Three Different Learning Criteria Jae Kyu Suhr Contents Introduction Three learning criteria LSE, TER, AUC Methods based on three learning criteria LSE:, ELM TER:

### Final Exam, Machine Learning, Spring 2009

Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

### Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

### Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Lecture 2 Judging the Performance of Classifiers Nitin R. Patel 1 In this note we will examine the question of how to udge the usefulness of a classifier and how to compare different classifiers. Not only

### Learning from Observations. Chapter 18, Sections 1 3 1

Learning from Observations Chapter 18, Sections 1 3 Chapter 18, Sections 1 3 1 Outline Learning agents Inductive learning Decision tree learning Measuring learning performance Chapter 18, Sections 1 3

### Machine Learning Lecture 7

Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

### Unsupervised Classifiers, Mutual Information and 'Phantom Targets'

Unsupervised Classifiers, Mutual Information and 'Phantom Targets' John s. Bridle David J.e. MacKay Anthony J.R. Heading California Institute of Technology 139-74 Defence Research Agency Pasadena CA 91125

### Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan, Steinbach, Kumar Adapted by Qiang Yang (2010) Tan,Steinbach,

### Multimodal Deep Learning for Predicting Survival from Breast Cancer

Multimodal Deep Learning for Predicting Survival from Breast Cancer Heather Couture Deep Learning Journal Club Nov. 16, 2016 Outline Background on tumor histology & genetic data Background on survival

### Using Impurity and Depth for Decision Trees Pruning

Using Impurity and epth for ecision Trees Pruning. Fournier and B. Crémilleux GRYC, CNRS - UPRSA 6072 Université de Caen F-14032 Caen Cédex France fournier,cremilleux @info.unicaen.fr Abstract Most pruning

### Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

### CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition Ad Feelders Universiteit Utrecht Department of Information and Computing Sciences Algorithmic Data

### WALD LECTURE II LOOKING INSIDE THE BLACK BOX. Leo Breiman UCB Statistics

1 WALD LECTURE II LOOKING INSIDE THE BLACK BOX Leo Breiman UCB Statistics leo@stat.berkeley.edu ORIGIN OF BLACK BOXES 2 Statistics uses data to explore problems. Think of the data as being generated by

### Classification 1: Linear regression of indicators, linear discriminant analysis

Classification 1: Linear regression of indicators, linear discriminant analysis Ryan Tibshirani Data Mining: 36-462/36-662 April 2 2013 Optional reading: ISL 4.1, 4.2, 4.4, ESL 4.1 4.3 1 Classification

### Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

### Linear Classification and SVM. Dr. Xin Zhang

Linear Classification and SVM Dr. Xin Zhang Email: eexinzhang@scut.edu.cn What is linear classification? Classification is intrinsically non-linear It puts non-identical things in the same class, so a

### Multivariate Analysis Techniques in HEP

Multivariate Analysis Techniques in HEP Jan Therhaag IKTP Institutsseminar, Dresden, January 31 st 2013 Multivariate analysis in a nutshell Neural networks: Defeating the black box Boosted Decision Trees:

### Decision Trees. Danushka Bollegala

Decision Trees Danushka Bollegala Rule-based Classifiers In rule-based learning, the idea is to learn a rule from train data in the form IF X THEN Y (or a combination of nested conditions) that explains

### Decision T ree Tree Algorithm Week 4 1

Decision Tree Algorithm Week 4 1 Team Homework Assignment #5 Read pp. 105 117 of the text book. Do Examples 3.1, 3.2, 3.3 and Exercise 3.4 (a). Prepare for the results of the homework assignment. Due date

### Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Article from Predictive Analytics and Futurism July 2016 Issue 13 Regression and Classification: A Deeper Look By Jeff Heaton Classification and regression are the two most common forms of models fitted

### Artificial Neural Networks Examination, June 2005

Artificial Neural Networks Examination, June 2005 Instructions There are SIXTY questions. (The pass mark is 30 out of 60). For each question, please select a maximum of ONE of the given answers (either

### GROUND RESISTANCE ESTIMATION USING INDUCTIVE MACHINE LEARNING

The 19 th International Symposium on High Voltage Engineering, Pilsen, Czech Republic, August, 23 28, 215 GROUND RESISTANCE ESTIMATION USING INDUCTIVE MACHINE LEARNING V. P. Androvitsaneas *1, I. F. Gonos

### Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture

### Intelligence and statistics for rapid and robust earthquake detection, association and location

Intelligence and statistics for rapid and robust earthquake detection, association and location Anthony Lomax ALomax Scientific, Mouans-Sartoux, France anthony@alomax.net www.alomax.net @ALomaxNet Alberto

### Machine Learning Lecture 2

Machine Perceptual Learning and Sensory Summer Augmented 15 Computing Many slides adapted from B. Schiele Machine Learning Lecture 2 Probability Density Estimation 16.04.2015 Bastian Leibe RWTH Aachen

### Collaborative topic models: motivations cont

Collaborative topic models: motivations cont Two topics: machine learning social network analysis Two people: " boy Two articles: article A! girl article B Preferences: The boy likes A and B --- no problem.

### Development of a Data Mining Methodology using Robust Design

Development of a Data Mining Methodology using Robust Design Sangmun Shin, Myeonggil Choi, Youngsun Choi, Guo Yi Department of System Management Engineering, Inje University Gimhae, Kyung-Nam 61-749 South

### Probabilistic Graphical Models

Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

### UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

### SPARSE HIDDEN MARKOV MODELS FOR PURER CLUSTERS

SPARSE HIDDEN MARKOV MODELS FOR PURER CLUSTERS Sujeeth Bharadwaj, Mark Hasegawa-Johnson, Jitendra Ajmera, Om Deshmukh, Ashish Verma Department of Electrical Engineering, University of Illinois, Urbana,

### Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen

Book Title Encyclopedia of Machine Learning Chapter Number 00403 Book CopyRight - Year 2010 Title Frequent Pattern Author Particle Given Name Hannu Family Name Toivonen Suffix Email hannu.toivonen@cs.helsinki.fi

### Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Week 10 Based in part on slides from textbook, slides of Susan Holmes Part I Linear regression & December 5, 2012 1 / 1 2 / 1 We ve talked mostly about classification, where the outcome categorical. If

### SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION

International Journal of Pure and Applied Mathematics Volume 87 No. 6 2013, 741-750 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu doi: http://dx.doi.org/10.12732/ijpam.v87i6.2

### Maximum Mean Discrepancy

Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia

### Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

+ Machine Learning and Data Mining Decision Trees Prof. Alexander Ihler Decision trees Func-onal form f(x;µ): nested if-then-else statements Discrete features: fully expressive (any func-on) Structure:

### Statistical Independence and Novelty Detection with Information Preserving Nonlinear Maps

Statistical Independence and Novelty Detection with Information Preserving Nonlinear Maps Lucas Parra, Gustavo Deco, Stefan Miesbach Siemens AG, Corporate Research and Development, ZFE ST SN 4 Otto-Hahn-Ring

### Introduction to Information Theory and Its Applications

Introduction to Information Theory and Its Applications Radim Bělohlávek Information Theory: What and Why information: one of key terms in our society: popular keywords such as information/knowledge society

### ECE521 Lecture7. Logistic Regression

ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

### A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized

### Data Preprocessing. Data Preprocessing

Data Preprocessing 1 Data Preprocessing Normalization: the process of removing sampleto-sample variations in the measurements not due to differential gene expression. Bringing measurements from the different

### Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Günther Eibl and Karl Peter Pfeiffer Institute of Biostatistics, Innsbruck, Austria guenther.eibl@uibk.ac.at Abstract.

### 16.4 Multiattribute Utility Functions

285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate

### Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function

### Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

### From inductive inference to machine learning

From inductive inference to machine learning ADAPTED FROM AIMA SLIDES Russel&Norvig:Artificial Intelligence: a modern approach AIMA: Inductive inference AIMA: Inductive inference 1 Outline Bayesian inferences

### Minimal Attribute Space Bias for Attribute Reduction

Minimal Attribute Space Bias for Attribute Reduction Fan Min, Xianghui Du, Hang Qiu, and Qihe Liu School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu

### Comparing Robustness of Pairwise and Multiclass Neural-Network Systems for Face Recognition

Comparing Robustness of Pairwise and Multiclass Neural-Network Systems for Face Recognition J. Uglov, V. Schetinin, C. Maple Computing and Information System Department, University of Bedfordshire, Luton,

### Single gene analysis of differential expression. Giorgio Valentini

Single gene analysis of differential expression Giorgio Valentini valenti@disi.unige.it Comparing two conditions Each condition may be represented by one or more RNA samples. Using cdna microarrays, samples

### Complexity Bounds of Radial Basis Functions and Multi-Objective Learning

Complexity Bounds of Radial Basis Functions and Multi-Objective Learning Illya Kokshenev and Antônio P. Braga Universidade Federal de Minas Gerais - Depto. Engenharia Eletrônica Av. Antônio Carlos, 6.67

### P leiades: Subspace Clustering and Evaluation

P leiades: Subspace Clustering and Evaluation Ira Assent, Emmanuel Müller, Ralph Krieger, Timm Jansen, and Thomas Seidl Data management and exploration group, RWTH Aachen University, Germany {assent,mueller,krieger,jansen,seidl}@cs.rwth-aachen.de

### Analysis of Fast Input Selection: Application in Time Series Prediction

Analysis of Fast Input Selection: Application in Time Series Prediction Jarkko Tikka, Amaury Lendasse, and Jaakko Hollmén Helsinki University of Technology, Laboratory of Computer and Information Science,