Report on Preliminary Experiments with Data Grid Models in the Agnostic Learning vs. Prior Knowledge Challenge

Similar documents
Compression-Based Averaging of Selective Naive Bayes Classifiers

Probabilistic Learning

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Introduction to Machine Learning

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Introduction to Machine Learning

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

The Naïve Bayes Classifier. Machine Learning Fall 2017

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Data Mining and Analysis

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

Decision Trees (Cont.)

LEC 4: Discriminant Analysis for Classification

Click Prediction and Preference Ranking of RSS Feeds

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Handling imprecise and uncertain class labels in classification and clustering

Empirical Risk Minimization

Tutorial 2. Fall /21. CPSC 340: Machine Learning and Data Mining

GWAS IV: Bayesian linear (variance component) models

CS-E3210 Machine Learning: Basic Principles

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Data Mining and Knowledge Discovery. Petra Kralj Novak. 2011/11/29

Support Vector Machines

An Introduction to Multivariate Methods

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

LogitBoost with Trees Applied to the WCCI 2006 Performance Prediction Challenge Datasets

Computational Oracle Inequalities for Large Scale Model Selection Problems

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

CS145: INTRODUCTION TO DATA MINING

Machine Learning Linear Classification. Prof. Matteo Matteucci

PATTERN RECOGNITION AND MACHINE LEARNING

BINARY TREE-STRUCTURED PARTITION AND CLASSIFICATION SCHEMES

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Probabilistic Graphical Models

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Brief Introduction of Machine Learning Techniques for Content Analysis

ECE521 week 3: 23/26 January 2017

Word Sense Disambiguation (WSD) Based on Foundations of Statistical NLP by C. Manning & H. Schütze, ch. 7 MIT Press, 2002

Naïve Bayes classification

Gaussian Models

Least Squares Regression

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Chapter 7, continued: MANOVA

Learning Bayesian belief networks

Correlation Preserving Unsupervised Discretization. Outline

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Bagging and Other Ensemble Methods

Will it rain tomorrow?

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Department of Computer and Information Science and Engineering. CAP4770/CAP5771 Fall Midterm Exam. Instructor: Prof.

Bayesian Classification Methods

Data Mining and Analysis: Fundamental Concepts and Algorithms

Machine Learning and Deep Learning! Vincent Lepetit!

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions

Tufts COMP 135: Introduction to Machine Learning

Credal Classification

Algorithmisches Lernen/Machine Learning

Notes on Discriminant Functions and Optimal Classification

Classification techniques focus on Discriminant Analysis

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Microarray Data Analysis: Discovery

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Tools of AI. Marcin Sydow. Summary. Machine Learning

The Variational Gaussian Approximation Revisited

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Lecture 5: Classification

Bayesian Learning. Bayesian Learning Criteria

CSE 5243 INTRO. TO DATA MINING

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

INTRODUCTION TO PATTERN RECOGNITION

Modern Information Retrieval

Machine Learning 2nd Edition

CHAPTER-17. Decision Tree Induction

Algorithms for Classification: The Basic Methods

Introduction to Bayesian Learning. Machine Learning Fall 2018

Bayesian Support Vector Machines for Feature Ranking and Selection

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Bayesian Networks with Imprecise Probabilities: Classification

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

When Dictionary Learning Meets Classification

Unsupervised Learning. k-means Algorithm

Introduction to Machine Learning

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Least Squares Regression

Probabilistic Time Series Classification

Feature Engineering, Model Evaluations

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Be able to define the following terms and answer basic questions about them:

Midterm, Fall 2003

Notes on Machine Learning for and

Statistical Machine Learning Hilary Term 2018

Statistics Terminology

Transcription:

Report on Preliminary Experiments with Data Grid Models in the Agnostic Learning vs. Prior Knowledge Challenge Marc Boullé IJCNN 2007 08/2007 research & development

Outline From univariate to multivariate supervised data grids Evaluation on the challenge Conclusion IJCNN 2007 08/2007 Marc Boullé p 2 research & development France Telecom Group

Numerical variables Univariate analysis owing to supervised discretization Context Supervised classification Data preparation Discretisation Slice a numerical domain into intervals 14 12 Base Iris Main issues Informational quality Good fit of the data Statistical quality: Good generalization Instances 10 8 6 4 2 0 2.0 2.5 3.0 3.5 4.0 Sepal width Versicolor Virginica Setosa IJCNN 2007 08/2007 Marc Boullé p 3 research & development France Telecom Group

Supervised discretization Conditional density estimation 14 12 10 Base Iris Versicolor 34 15 1 Virginica 21 24 5 Setosa 2 18 30 Sepal width ]- inf ; 2.95[ [2.95; 3.35[ [3.35 ; inf [ Base Iris Instances 8 6 4 2 Versicolor Virginica Setosa Instances 60 50 40 30 20 Versicolor Virginica Setosa 0 2.0 2.5 3.0 3.5 4.0 Sepal width 10 0 ]- inf ; 2.95[ [2.95; 3.35[ [3.35 ; inf [ Sepal Width Which model is the best one? IJCNN 2007 08/2007 Marc Boullé p 4 research & development France Telecom Group

MODL approach: principles Definition of the model space Definition of the parameters (I, N i., N ij ) Definition of a prior distribution on the model parameters Hierarchical prior Uniform at each stage of the hierarchy of the parameters Independence asumption of the distribution between the intervals Exact analytical criterion to evaluate the models I I 1 J 1 ( N) + ( C ) + ( C ) + ( N N N N ) log log log log!!!...! N+ I 1 Ni. + J 1 i. i1 i2 ij i= 1 i= 1 I prior likelihood Efficient optimization: O(N log(n)) IJCNN 2007 08/2007 Marc Boullé p 5 research & development France Telecom Group

Categorical variables Univariate analysis owing to value grouping Cap color BROWN GRAY RED YELLOW WHITE BUFF PINK CINNAMON GREEN PURPLE EDIBLE 55.2% 61.2% 40.2% 38.4% 69.9% 30.3% 39.6% 71.0% 100.0% 100.0% POISONOUS 44.8% 38.8% 59.8% 61.6% 30.1% 69.7% 60.4% 29.0% 0.0% 0.0% Frequency 1610 1458 1066 743 711 122 101 31 13 10 Cap color G_RED G_BROWN G_GRAY G_WHITE G_GREEN G_RED RED YELLOW BUFF PINK EDIBLE G_WHITE WHITE CINNAMON 38.9% 55.2% 61.2% 69.9% 100.0% POISONOUS GRAY G_BROWN G_GRAY BROWN 61.1% 44.8% 38.8% 30.1% 0.0% G_GREEN GREEN PURPLE Frequency 2032 1610 1458 742 23 Which model is the best one? IJCNN 2007 08/2007 Marc Boullé p 6 research & development France Telecom Group

Supervised value grouping Same approach as for discretization A value grouping model is defined by: The number of groups of values, The partition of the input values into groups, The distribution of the target values for each group. Model selection Bayesian approach for model selection Hierarchical prior for the model parameters Exact analytical criterion to evaluate the models ( ) ( ) I J ( ) ( 1 N + J 1 ) ( i. i1 i2 ij ) log V + log B V, I + log C + log N! N! N!... N! i. i= 1 i= 1 I prior lilelihood Efficient optimization: O(N log(n)) IJCNN 2007 08/2007 Marc Boullé p 7 research & development France Telecom Group

Pair of numerical variables Bivariate discretization to evaluate the conditional density 6 V7 Wine Each variable is discretized 5 4 3 Class 1 Class 2 Class 3 We get a bivariate data grid 2 1 0 11 12 13 14 15 V1 In each cell of the data grid, estimation of the conditional density ]2.18;+inf[ ]1.235;2.18] (0, 23, 0) (0, 35, 0) (59, 0, 4) (0, 5, 6) ]-inf;1.235] (0, 4, 11) (0, 0, 31) V7xV1 ]-inf;12.78] ]12.78;+inf[ IJCNN 2007 08/2007 Marc Boullé p 8 research & development France Telecom Group

Application of the MODL approach Definition of the model space Definition of the parameters I, I, N, N, N 1 2 i... i. ii j 1 2 1 2 Definition of a prior distribution on the model parameters Hierarchical prior Uniform at each stage of the hierarchy of the parameters Independence asumption of the class distribution between the cells Exact analytical criterion to evaluate the models prior likelihood 1 2 i = 1 i = 1 1 2 I1 I2 I1 1 I2 1 J 1 ( N) ( CN+ I ) ( ) ( ) ( ) 1 1 N CN+ I2 1 CNii. + J 1 log + log + log + log + log + I I ( Nii. Nii 1 Nii 2 Nii J ) log!!!...! 12 12 12 12 i = 1 i = 1 1 2 12 IJCNN 2007 08/2007 Marc Boullé p 9 research & development France Telecom Group

Optimization algorithms Main algorithm: greedy top-down merge heuristic Post-optimization by alterning univariate optimizations Freeze the partition of X 1 and optimize the partition of variable X 2 Freeze the partition of X 2 and optimize the partition of variable X 1 Global search: Variable Neighborhood Search (VNS) meta-heuristic Algorithmic complexity: ( log ( N) ) O N Exploiting sparseness of data grids At most N non-empty cells among N 2 potential cells Exploiting the additivity the evaluation criterion IJCNN 2007 08/2007 Marc Boullé p 10 research & development France Telecom Group

Illustration: noise or information? Diagram 1 Diagram 2 Diagram 3 IJCNN 2007 08/2007 Marc Boullé p 11 research & development France Telecom Group

Diagram 1 Noise Diagram 1 (2000 points) Data grid 1 x 1 (1 cell) Criterion value = 2075 IJCNN 2007 08/2007 Marc Boullé p 12 research & development France Telecom Group

Diagram 2 Chessboard 8 x 8 with 25% noise Diagram 2 (2000 points) Data grid 8 x 8 (64 cells) Criterion value = 1900 IJCNN 2007 08/2007 Marc Boullé p 13 research & development France Telecom Group

Diagram 3 Chessboard 32 x 32, no noise Diagram 3 (2000 points) Data grid 32 x 32 (1024 cells) Criterion value = 1928 IJCNN 2007 08/2007 Marc Boullé p 14 research & development France Telecom Group

Multivariate data grid models for conditional density estimation Data grid model Variable selection (k K S ) Discretization of numerical variables (k K 1 ) Value grouping of categorical variables (k K 2 ) For each cell of the resulting data grid, definition of the distribution of the output values Model selection Bayesian approach Hierachical prior on the parameters Exact analytical criterion Advanced optimization heuristics K: number of variables N: number of instances ( log ( ) max(,log ( )) O KN N N K N likelihood prior 1 2 1 2 K i = 1i = 1 i = 1 1 2 1 2 K 1 2 KS 1 ( CK+ K 1) S Ik 1 ( N) log ( CN+ I 1 ) log( K+ 1) + log + k Κ Κ S k Κ Κ S I I I ( log ) k ( log ( Vk) log ( B( Vk, Ik) ))... log K + + + + J 1 ( CN 1) ii + J 12... i K... log ( Nii ) ( ) 12... i! log 12...! K Nii ik j K I I I J i = 1i = 1 i = 1 j= 1 + IJCNN 2007 08/2007 Marc Boullé p 15 research & development France Telecom Group

Outline From univariate to multivariate supervised data grids Evaluation on the challenge Conclusion IJCNN 2007 08/2007 Marc Boullé p 16 research & development France Telecom Group

Building classifiers from data grids Data grid classifier (DG) Train the MAP data grid on all the available data (no validation set) For a test instance: retrieve the related train cell predict the empirical target distribution of that cell Data grid ensemble classifier (DGE) Collect the train data grids encountered during optimization Compression-based averaging Weight according to the posterior probability (with a logarithmic smoothing) Coclustering of instances and variables (DGCC) See paper IJCNN 2007 08/2007 Marc Boullé p 17 research & development France Telecom Group

Challenge results Dataset My best entry Entry ID Test BER Test AUC Score Track Best test BER (agnostic) Best test BER (prior) ADA Data Grid (CMA) 920 0.1756 0.8464 0.0245 Prior 0.166 0.1756 GINA Data Grid (Coclustering) 921 0.0516 0.9768 0.3718 Prior 0.0339 0.0226 HIVA Data Grid (Coclustering) 921 0.3127 0.7077 0.5904 Agnos 0.2827 0.2693 NOVA Data Grid (Coclustering) 921 0.0488 0.9813 0.141 Agnos 0.0456 0.0659 SYLVA Data Grid (CMA) 918 0.0158 0.9873 0.6482 Agnos 0.0062 0.0043 Overall Data Grid (Coclustering) 921 0.1223 0.8984 0.3813 Prior 0.1117 0.1095 IJCNN 2007 08/2007 Marc Boullé p 18 research & development France Telecom Group

Analysis of the results Predictive performance Data grids classifier perform well on dense datasets Building ensemble of classifiers is better Coclustering looks appealing for sparse datasets Interpretation Few variables are selected (for example in the prior track) Ada: 6 variables (Test BER = 0.2058) Gina: 7 variables (Test BER = 0.1721) Sylva: 4 variables (Test BER=0.0099) IJCNN 2007 08/2007 Marc Boullé p 19 research & development France Telecom Group

MAP data grid for Ada dataset (prior) Six selected variables occupation {Prof-specialty,...} Prof-specialty Exec-managerial Sales Adm-clerical Tech-support Protective-serv Armed-Forces {Craft-repair,...} Craft-repair Other-service Machine-op-inspct Transport-moving Handlers-cleaners Farming-fishing Priv-house-serv relationship {Not-in-family,...} Not-in-family Own-child Unmarried Other-relative {Husband, } Husband Wife educationnum ]-inf;12.5] ]12.5;+inf[ age ]-inf;27.5] ]27.5;+inf[ capitalgain ]-inf;4668.5] ]4668.5;5095.5] ]5095.5;+inf[ capitalloss ]-inf;1805] ]1805;2001.5] ]2001.5;+inf[ IJCNN 2007 08/2007 Marc Boullé p 20 research & development France Telecom Group

Adult data grid cells 12 most frequent cells (90% of the instances) relationship occupation educationnum age capitalgain capitalloss Class -1 Class 1 Frequency {Husband,...} {Craft-repair,...} ]-inf;12.5] ]27.5;+inf[ ]-inf;4668.5] ]-inf;1805] 78% 22% 736 {Not-in-family,...} {Craft-repair,...} ]-inf;12.5] ]27.5;+inf[ ]-inf;4668.5] ]-inf;1805] 97% 3% 577 {Not-in-family,...} {Prof-specialty,...} ]-inf;12.5] ]27.5;+inf[ ]-inf;4668.5] ]-inf;1805] 94% 6% 531 {Husband,...} {Prof-specialty,...} ]-inf;12.5] ]27.5;+inf[ ]-inf;4668.5] ]-inf;1805] 59% 41% 489 {Husband,...} {Prof-specialty,...} ]12.5;+inf[ ]27.5;+inf[ ]-inf;4668.5] ]-inf;1805] 31% 69% 445 {Not-in-family,...} {Craft-repair,...} ]-inf;12.5] ]-inf;27.5] ]-inf;4668.5] ]-inf;1805] 100% 0% 425 {Not-in-family,...} {Prof-specialty,...} ]-inf;12.5] ]-inf;27.5] ]-inf;4668.5] ]-inf;1805] 99% 1% 316 {Not-in-family,...} {Prof-specialty,...} ]12.5;+inf[ ]27.5;+inf[ ]-inf;4668.5] ]-inf;1805] 79% 21% 268 {Not-in-family,...} {Prof-specialty,...} ]12.5;+inf[ ]-inf;27.5] ]-inf;4668.5] ]-inf;1805] 99% 1% 112 {Husband,...} {Craft-repair,...} ]-inf;12.5] ]-inf;27.5] ]-inf;4668.5] ]-inf;1805] 95% 5% 96 {Husband,...} {Prof-specialty,...} ]12.5;+inf[ ]27.5;+inf[ ]5095.5;+inf[ ]-inf;1805] 0% 100% 93 {Husband,...} {Craft-repair,...} ]12.5;+inf[ ]27.5;+inf[ ]-inf;4668.5] ]-inf;1805] 76% 24% 50 IJCNN 2007 08/2007 Marc Boullé p 21 research & development France Telecom Group

Outline From univariate to multivariate supervised data grids Evaluation on the challenge Conclusion IJCNN 2007 08/2007 Marc Boullé p 22 research & development France Telecom Group

Data grids for supervised learning Evaluation of class conditional density P(Y X 1, X 2,, X K ) Each input variable is partitioned into intervals or groups of values The cross-product of the univariate partitions forms a data grid, used to evaluate the correlation with the class variable MODL approach Bayesian model selection approach Analytical evaluation criterion Efficient combinatorial optimization algorithms High quality data grids Experimental evaluation Good prediction in dense datasets Variable selection (limited to log 2 N variables) Reliable interpretation Efficient technique for data preparation IJCNN 2007 08/2007 Marc Boullé p 23 research & development France Telecom Group

Future work Explore the potential and limits of data grids For all tasks of machine learning: supervised and unsupervised For data preparation, prediction, explanation Combine data grids with the naive Bayes approach Use data grids to build informative multivariate features Use naive Bayes classification to combine features Use ensemble of classifiers to further improve the results IJCNN 2007 08/2007 Marc Boullé p 24 research & development France Telecom Group