MDL-Based Unsupervised Attribute Ranking

Similar documents
Boostrapaggregating (Bagging)

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

A Robust Method for Calculating the Correlation Coefficient

Cluster Validation Determining Number of Clusters. Umut ORHAN, PhD.

Chapter 4.5 Association Rules. CSCI 347, Data Mining

Statistical pattern recognition

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Aggregation of Social Networks by Divisive Clustering Method

Evaluation for sets of classes

Bayesian Learning. Smart Home Health Analytics Spring Nirmalya Roy Department of Information Systems University of Maryland Baltimore County

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Lecture Notes on Linear Regression

Homework Assignment 3 Due in class, Thursday October 15

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

MAXIMUM A POSTERIORI TRANSDUCTION

Maximum Likelihood Estimation (MLE)

A Network Intrusion Detection Method Based on Improved K-means Algorithm

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

Durban Watson for Testing the Lack-of-Fit of Polynomial Regression Models without Replications

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Chapter 8 Indicator Variables

x = , so that calculated

Ensemble Methods: Boosting

Classification as a Regression Problem

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Multilayer Perceptron (MLP)

VQ widely used in coding speech, image, and video

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

is the calculated value of the dependent variable at point i. The best parameters have values that minimize the squares of the errors

Markov Chain Monte Carlo Lecture 6

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

Statistics for Managers Using Microsoft Excel/SPSS Chapter 13 The Simple Linear Regression Model and Correlation

Exercises of Chapter 2

Learning Theory: Lecture Notes

} Often, when learning, we deal with uncertainty:

Statistics for Economics & Business

Finding Dense Subgraphs in G(n, 1/2)

EM and Structure Learning

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Supporting Information

Classification. Representing data: Hypothesis (classifier) Lecture 2, September 14, Reading: Eric CMU,

Machine Learning for Language Technology Lecture 8: Decision Trees and k- Nearest Neighbors

Problem Set 9 Solutions

Fuzzy Boundaries of Sample Selection Model

CHAPTER IV RESEARCH FINDING AND ANALYSIS

Keyword Reduction for Text Categorization using Neighborhood Rough Sets

Lossy Compression. Compromise accuracy of reconstruction for increased compression.

10-701/ Machine Learning, Fall 2005 Homework 3

Clustering with Gaussian Mixtures

Kernel Methods and SVMs Extension

Natural Language Processing and Information Retrieval

Decision-making and rationality

Internet Engineering. Jacek Mazurkiewicz, PhD Softcomputing. Part 3: Recurrent Artificial Neural Networks Self-Organising Artificial Neural Networks

EGR 544 Communication Theory

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

Tracking with Kalman Filter

Comparison of the Population Variance Estimators. of 2-Parameter Exponential Distribution Based on. Multiple Criteria Decision Making Method

Lecture 10 Support Vector Machines II

Stanford University CS254: Computational Complexity Notes 7 Luca Trevisan January 29, Notes for Lecture 7

Application research on rough set -neural network in the fault diagnosis system of ball mill

Dynamic Ensemble Selection and Instantaneous Pruning for Regression

CHAPTER IV RESEARCH FINDING AND DISCUSSIONS

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Clustering & Unsupervised Learning

AN IMPROVED METHOD OF HIERARCHIC ANALYSIS FOR CHOOSING OPTIMAL INFORMATION PROTECTION SYSTEM IN COMPUTER NETWORKS

Dimension Reduction and Visualization of the Histogram Data

Goodness of fit and Wilks theorem

Joint Statistical Meetings - Biopharmaceutical Section

Provable Security Signatures

Probability Theory. The nth coefficient of the Taylor series of f(k), expanded around k = 0, gives the nth moment of x as ( ik) n n!

MAKING A DECISION WHEN DEALING WITH UNCERTAIN CONDITIONS

Composite Hypotheses testing

Detecting Attribute Dependencies from Query Feedback

Clustering & (Ken Kreutz-Delgado) UCSD

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

Change Detection: Current State of the Art and Future Directions

Manning & Schuetze, FSNLP (c)1999, 2001

CHAPTER 3: BAYESIAN DECISION THEORY

Turbulence classification of load data by the frequency and severity of wind gusts. Oscar Moñux, DEWI GmbH Kevin Bleibler, DEWI GmbH

Outline. Communication. Bellman Ford Algorithm. Bellman Ford Example. Bellman Ford Shortest Path [1]

An Enterprise Competitive Capability Evaluation Based on Rough Sets

Lecture Nov

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification

STATISTICAL MECHANICS

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Lecture 12: Classification

CLASSIFICATION OF HIGGS JETS AS DECAY PRODUCTS OF A RANDALL-SUNDRUM GRAVITON AT THE ATLAS EXPERIMENT

Neryškioji dichotominių testo klausimų ir socialinių rodiklių diferencijavimo savybių klasifikacija

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Lecture 4 Hypothesis Testing

Interactive Bi-Level Multi-Objective Integer. Non-linear Programming Problem

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Linear Feature Engineering 11

Department of Quantitative Methods & Information Systems. Time Series and Their Components QMIS 320. Chapter 6

Complete subgraphs in multipartite graphs

Sparse Gaussian Processes Using Backward Elimination

Transcription:

MDL-Based Unsupervsed Attrbute Rankng Zdravko Markov Computer Scence Department Central Connectcut State Unversty New Brtan, CT 06050, USA http://www.cs.ccsu.edu/~markov/ markovz@ccsu.edu

MDL-Based Unsupervsed Attrbute Rankng Introducton (Attrbute Selecton) MDL-based Clusterng Model Evaluaton Illustratve Example ( play tenns data) Attrbute Rankng Algorthm Herarchcal Clusterng Algorthm Expermental Evaluaton Concluson

Attrbute Selecton Supervsed / Unsupervsed. Fnd the smallest set of attrbutes that maxmzes predctve accuracy best uncovers nterestng natural groupngs (clusters) n data accordng to the chosen crteron Subset Selecton / Rankng (Weghtng) Computatonally expensve: 2 m attrbute sets for m attrbutes Assumes that attrbutes are ndependent

Supervsed Attrbute Selecton Wrapper methods create predcton models and use the predctve accuracy of these models to measure the attrbute relevance to the classfcaton task. Flter methods drectly measure the ablty of the attrbutes to determne the class labels usng statstcal correlaton, nformaton metrcs, probablstc or other methods. There exst numerous methods n ths settng due to the wde avalablty of model evaluaton crtera n supervsed learnng.

Unsupervsed Attrbute Selecton Wrapper methods evaluate a subset of attrbutes by the qualty of clusterng obtaned by usng these attrbutes. Flter methods explore classcal statstcal methods for dmensonalty reducton, lke PCA and maxmum varance, nformaton-based or entropy measures. There exst very few methods n ths settng generally because of the dffculty to evaluate clusterng models.

Clusterng Model Evaluaton Chapter 4: Evaluatng Clusterng - MDL-Based Model and Feature Evaluaton http://www.cs.ccsu.edu/~markov/ http://www.cs.ccsu.edu/~markov/dmw4.pdf http://www.cs.ccsu.edu/~markov/dmwdata.zp http://www.cs.ccsu.edu/~markov/dmwsoftware.zp

Clusterng Model Evaluaton Consder each possble clusterng as a hypothess H that descrbes (explans) data D n terms of frequent patterns (regulartes). Compute the descrpton length of the data L(D), the hypothess L(H), and data gven the hypothess L(D H). L(H) and L(D) are the mnmum number of bts needed to encode (or communcate) H and D respectvely. L(D H) represents the number of bts needed to encode D f we know H. If we know the pattern of H, no need to encode all ts occurrences n D, rather we may encode only the pattern tself and the dfferences that dentfy each ndvdual nstance n D.

Mnmum Descrpton Length (MDL) and Informaton Compresson The more regularty n D the shorter descrpton length L(D H). Need to balance L(D H) wth L(H), because the latter depends on the complexty of the pattern. Thus the best hypothess should mnmze the sum L(H)+L(D H) (MDL prncple) or maxmze L(D) L(H) L(D H) (Informaton Compresson)

Encodng MDL Hypotheses and data are unformly dstrbuted and the probablty of occurrence of an tem out of n alternatves s /n. Mnmum code length of the message that a partcular tem has occurred s log 2 /n log 2 n bts. The number of bts needed to encode the choce of k tems out of n possble tems s n log2 log n 2 k k

Encodng MDL (attrbute-value) Data D, nstance X D, X s a set of m attrbute values, X m - set of all attrbute values n D, k T Cluster C s defned by the set of all attrbute values T T that occur n ts members, C {X C, X T } Clusterng H {C,C 2,,C n } s defned by {T,T 2,,T n }, k T U D X X T n k k C L 2 log 2 log ) ( + m k C C D L 2 log ) ( + + m k C n k k C MDL 2 2 2 log log log ) ( n C L H L ) ( ) ( n D C L H D L ) ( ) ( n C MDL H MDL ) ( ) (

Play Tenns Data ID outlook temp humdty wndy play sunny hot hgh false no 2 sunny hot hgh true no 3 overcast hot hgh false yes 4 rany mld hgh false yes 5 rany cool normal false yes 6 rany cool normal true no 7 overcast cool normal true yes 8 sunny mld hgh false no 9 sunny cool normal false yes 0 rany mld normal false yes sunny mld normal true yes 2 overcast mld hgh true yes 3 overcast hot normal false yes 4 rany mld hgh true no C {, 2, 3, 4, 8, 2, 4} (humdtyhgh) C 2 {5, 6, 7, 9, 0,, 3} (humdtynormal) T {outlooksunny, outlookovercast, outlookrany, temphot, tempmld, humdtyhgh, wndyfalse, wndytrue} T 2 {outlooksunny, outlookovercast, outlookrany, temphot, tempmld, tempcool, humdtynormal, wndyfalse, wndytrue}.

Clusterng Play Tenns Data k MDL( C ) log + 2 log2 n + C log k 2 k m k T 8, k 2 T 2 9, k 0, m 4, n 2 0 8 MDL( C ) log2 + log2 2 + 7 log2 8 4 0 9 MDL( C ) log2 + log2 2 + 7 log2 9 4 2 49.39 53.6 MDL({C, C 2 }) MDL(humdty) 02.55 bts. MDL(temp) 0.87 2. MDL(humdty) 02.56 3. MDL(outlook) 03.46 4. MDL(wndy) 06.33 Best attrbute s temp

MDL Ranker Let A have values v, v 2,, v p Clusterng {C,C 2,,C p }, where C {X x X} Let V A For each data nstance X {x, x 2,, x m } For each attrbute A For each value x V A V A {x } m A k V j j Compute MDL({C,C 2,,C p }) Incremental (no need to store nstances) Tme O(nm 2 ), n s the number of data nstances Space O(pm 2 ), p s the max number of attrbute values Evaluates 3204 nstances wth 395 attrbutes (trec data) n 3 mnutes.

Expermental Evaluaton Data Data Set Instances Attrbutes Classes reuters 504 2887 3 reuters-3class 46 2887 3 reuters-2class 927 2887 2 trec 3204 395 6 soybean 683 36 9 soybean-small 47 36 4 rs 50 5 3 onosphere 35 35 2 Java mplementatons of MDL rankng and clusterng avalable from http://www.cs.ccsu.edu/~markov/dmwsoftware.zp

Expermental Evaluaton Metrcs Average Precson D r D PrecsonAtRank(k) q k k PrecsonAtRank(k) k k Classes-to-clusters accuracy ( true cluster membershp) r 0 root [5, 9] temperaturehot [2, 2] outlooksunny [2] no outlookovercast [2] yes temperaturemld [4, 2] wndyfalse [2, ] yes wndytrue [2, ] yes temperaturecool [3, ] wndyfalse [2] yes wndytrue [, ] no ----------------------------------------- Clusters (leaves): 6 Correctly classfed nstances: (78%) r f a D q otherwse

Average Precson of Attrbute Rankng Data set D q InfoGan MDL Error Entropy reuters 5 0.383 0.435 0.0642 0.0030 reuters-3class 0 0.3948 0.852 0.257 0.0027 reuters-2class 7 0.506 0.2438 0.788 0.3073 trec 4 0.4890 0.244 0.0637 0.000 soybean 6 0.6265 0.5606 0.387 0.452 soybean-small 2 0.6428 0.3500 0.093 0.23 rs.0000.0000.0000 0.3333 onosphere 9 0.6596 0.504 0.2575 0.4252 D q set of attrbutes selected by Wrapper Subset Evaluator wth Naïve Bayes classfer. InfoGan supervsed attrbute rankng usng Informaton Gan Evaluator. Error unsupervsed rankng based on evaluatng the qualty of clusterng by the sum of squared errors. Entropy unsupervsed rankng based on the reducton of the entropy n data when the attrbute s removed (Dash and Lu 2000).

Classes-To-Clusters Accuracy Wth Reuters Data 60 MDL ranked InfoGan ranked 55 50 45 40 35 30 % Accuracy EM 25 20 2 3 5 0 20 30 50 00 200 300 500 000 2000 2887 75 MDL ranked InfoGan ranked 70 65 60 55 50 % Accuracy k-means 45 40 2 3 5 0 20 30 50 00 200 300 500 000 2000 2887

Classes-To-Clusters Accuracy Wth Reuters-3class Data EM 75 70 65 60 55 50 45 40 35 30 MDL ranked InfoGan ranked 2 3 5 0 20 30 50 00 200 300 500 000 2000 2886 K-means 75 70 65 60 55 50 45 40 MDL ranked InfoGan ranked 2 3 5 0 20 30 50 00 200 300 500 000 2000 2886

Classes-To-Clusters Accuracy Wth Soybean Data EM 80 70 60 50 40 30 20 0 0 MDL ranked InfoGan ranked 36 30 20 0 5 3 2 k-means 60 50 40 30 20 0 0 MDL ranked InfoGan ranked 36 30 20 0 5 3 2

MDL-Based Clusterng Functon MDL-Cluster(D). Choose attrbute A argmn MDL( A ) 2. Let A take values v, v 2,, v p 3. Splt data D C, C { X x X} U n p > 4. If Comp( A) Comp( C ) then stop. Return D. 5. For each,...,n Call MDL-Cluster(C )

Clusterng Reuters-2class Data root (56550.58) [608, 39] trade0 (434956.39) [507, 8] rate0 (26626.68) [339, 8] mone (2254.60) [48] money mone0 (6236.34) [9, 8] money rate (25589.70) [68] currenc0 (70870.68) [00] money currenc (5049.67) [68] money trade (204850.37) [30, 0] market0 (57978.80) [86, 39] countr (6448.90) [67, 20] trade countr0 (06457.20) [9, 9] trade market (06422.43) [5, 62] bank0 (73572.74) [94, ] trade bank (48489.70) [2, 5] money ----------------------------------------- Clusters (leaves): 8 Correctly classfed nstances: 838 (90%) MDL-Cluster Tree: root (56550.58) [608, 39] trade0 (434956.39) [507, 8] money trade (204850.37) [30, 0] market0 (57978.80) [86, 39] countr (6448.90) [67, 20] trade countr0 (06457.20) [9, 9] trade market (06422.43) [5, 62] bank0 (73572.74) [94, ] trade bank (48489.70) [2, 5] money ----------------------------------------- Clusters (leaves): 5 Correctly classfed nstances: 838 (90%)

Comparng MDL, EM and k-means Data set EM k-means MDL-Cluster Acc. % No. of Clusters Acc. % No. of Clusters Acc. % reuters 43 6 3 3 59 2 reuters-3class 58 3 48 3 73 7 reuters-2class 7 2 6 2 90 7 trec 26 6 29 6 44 soybean 60 9 5 9 5 7 soybean-small 00 4 9 4 83 4 rs 95 3 69 3 96 3 onosphere 89 2 8 2 80 3 No. of Clusters

Concluson MDL-ranker wthout class nformaton performs closely to the InfoGan method, whch essentally uses class nformaton. Thus, our approach can mprove the performance of clusterng algorthms n purely unsupervsed settng. MDL-cluster outperforms EM and k-means on most benchmark data sets. Numerc attrbutes? Subset evaluaton? Non-herarchcal clusterng? Thank You!