MDL-Based Unsupervised Attribute Ranking

Size: px

Start display at page:

Download "MDL-Based Unsupervised Attribute Ranking"

Oliver Fletcher
6 years ago
Views:

1 MDL-Based Unsupervsed Attrbute Rankng Zdravko Markov Computer Scence Department Central Connectcut State Unversty New Brtan, CT 06050, USA

2 MDL-Based Unsupervsed Attrbute Rankng Introducton (Attrbute Selecton) MDL-based Clusterng Model Evaluaton Illustratve Example ( play tenns data) Attrbute Rankng Algorthm Herarchcal Clusterng Algorthm Expermental Evaluaton Concluson

3 Attrbute Selecton Supervsed / Unsupervsed. Fnd the smallest set of attrbutes that maxmzes predctve accuracy best uncovers nterestng natural groupngs (clusters) n data accordng to the chosen crteron Subset Selecton / Rankng (Weghtng) Computatonally expensve: 2 m attrbute sets for m attrbutes Assumes that attrbutes are ndependent

4 Supervsed Attrbute Selecton Wrapper methods create predcton models and use the predctve accuracy of these models to measure the attrbute relevance to the classfcaton task. Flter methods drectly measure the ablty of the attrbutes to determne the class labels usng statstcal correlaton, nformaton metrcs, probablstc or other methods. There exst numerous methods n ths settng due to the wde avalablty of model evaluaton crtera n supervsed learnng.

5 Unsupervsed Attrbute Selecton Wrapper methods evaluate a subset of attrbutes by the qualty of clusterng obtaned by usng these attrbutes. Flter methods explore classcal statstcal methods for dmensonalty reducton, lke PCA and maxmum varance, nformaton-based or entropy measures. There exst very few methods n ths settng generally because of the dffculty to evaluate clusterng models.

Clusterng Model Evaluaton Chapter 4: Evaluatng Clusterng - MDL-Based Model and Feature Evaluaton http://www.cs.ccsu.

6 Clusterng Model Evaluaton Chapter 4: Evaluatng Clusterng - MDL-Based Model and Feature Evaluaton

7 Clusterng Model Evaluaton Consder each possble clusterng as a hypothess H that descrbes (explans) data D n terms of frequent patterns (regulartes). Compute the descrpton length of the data L(D), the hypothess L(H), and data gven the hypothess L(D H). L(H) and L(D) are the mnmum number of bts needed to encode (or communcate) H and D respectvely. L(D H) represents the number of bts needed to encode D f we know H. If we know the pattern of H, no need to encode all ts occurrences n D, rather we may encode only the pattern tself and the dfferences that dentfy each ndvdual nstance n D.

8 Mnmum Descrpton Length (MDL) and Informaton Compresson The more regularty n D the shorter descrpton length L(D H). Need to balance L(D H) wth L(H), because the latter depends on the complexty of the pattern. Thus the best hypothess should mnmze the sum L(H)+L(D H) (MDL prncple) or maxmze L(D) L(H) L(D H) (Informaton Compresson)

9 Encodng MDL Hypotheses and data are unformly dstrbuted and the probablty of occurrence of an tem out of n alternatves s /n. Mnmum code length of the message that a partcular tem has occurred s log 2 /n log 2 n bts. The number of bts needed to encode the choce of k tems out of n possble tems s n log2 log n 2 k k

10 Encodng MDL (attrbute-value) Data D, nstance X D, X s a set of m attrbute values, X m - set of all attrbute values n D, k T Cluster C s defned by the set of all attrbute values T T that occur n ts members, C {X C, X T } Clusterng H {C,C 2,,C n } s defned by {T,T 2,,T n }, k T U D X X T n k k C L 2 log 2 log ) ( + m k C C D L 2 log ) ( + + m k C n k k C MDL log log log ) ( n C L H L ) ( ) ( n D C L H D L ) ( ) ( n C MDL H MDL ) ( ) (

11 Play Tenns Data ID outlook temp humdty wndy play sunny hot hgh false no 2 sunny hot hgh true no 3 overcast hot hgh false yes 4 rany mld hgh false yes 5 rany cool normal false yes 6 rany cool normal true no 7 overcast cool normal true yes 8 sunny mld hgh false no 9 sunny cool normal false yes 0 rany mld normal false yes sunny mld normal true yes 2 overcast mld hgh true yes 3 overcast hot normal false yes 4 rany mld hgh true no C {, 2, 3, 4, 8, 2, 4} (humdtyhgh) C 2 {5, 6, 7, 9, 0,, 3} (humdtynormal) T {outlooksunny, outlookovercast, outlookrany, temphot, tempmld, humdtyhgh, wndyfalse, wndytrue} T 2 {outlooksunny, outlookovercast, outlookrany, temphot, tempmld, tempcool, humdtynormal, wndyfalse, wndytrue}.

12 Clusterng Play Tenns Data k MDL( C ) log + 2 log2 n + C log k 2 k m k T 8, k 2 T 2 9, k 0, m 4, n MDL( C ) log2 + log log MDL( C ) log2 + log log MDL({C, C 2 }) MDL(humdty) bts. MDL(temp) MDL(humdty) MDL(outlook) MDL(wndy) Best attrbute s temp

13 MDL Ranker Let A have values v, v 2,, v p Clusterng {C,C 2,,C p }, where C {X x X} Let V A For each data nstance X {x, x 2,, x m } For each attrbute A For each value x V A V A {x } m A k V j j Compute MDL({C,C 2,,C p }) Incremental (no need to store nstances) Tme O(nm 2 ), n s the number of data nstances Space O(pm 2 ), p s the max number of attrbute values Evaluates 3204 nstances wth 395 attrbutes (trec data) n 3 mnutes.

14 Expermental Evaluaton Data Data Set Instances Attrbutes Classes reuters reuters-3class reuters-2class trec soybean soybean-small rs onosphere Java mplementatons of MDL rankng and clusterng avalable from

15 Expermental Evaluaton Metrcs Average Precson D r D PrecsonAtRank(k) q k k PrecsonAtRank(k) k k Classes-to-clusters accuracy ( true cluster membershp) r 0 root [5, 9] temperaturehot [2, 2] outlooksunny [2] no outlookovercast [2] yes temperaturemld [4, 2] wndyfalse [2, ] yes wndytrue [2, ] yes temperaturecool [3, ] wndyfalse [2] yes wndytrue [, ] no Clusters (leaves): 6 Correctly classfed nstances: (78%) r f a D q otherwse

16 Average Precson of Attrbute Rankng Data set D q InfoGan MDL Error Entropy reuters reuters-3class reuters-2class trec soybean soybean-small rs onosphere D q set of attrbutes selected by Wrapper Subset Evaluator wth Naïve Bayes classfer. InfoGan supervsed attrbute rankng usng Informaton Gan Evaluator. Error unsupervsed rankng based on evaluatng the qualty of clusterng by the sum of squared errors. Entropy unsupervsed rankng based on the reducton of the entropy n data when the attrbute s removed (Dash and Lu 2000).

17 Classes-To-Clusters Accuracy Wth Reuters Data 60 MDL ranked InfoGan ranked % Accuracy EM MDL ranked InfoGan ranked % Accuracy k-means

18 Classes-To-Clusters Accuracy Wth Reuters-3class Data EM MDL ranked InfoGan ranked K-means MDL ranked InfoGan ranked

19 Classes-To-Clusters Accuracy Wth Soybean Data EM MDL ranked InfoGan ranked k-means MDL ranked InfoGan ranked

20 MDL-Based Clusterng Functon MDL-Cluster(D). Choose attrbute A argmn MDL( A ) 2. Let A take values v, v 2,, v p 3. Splt data D C, C { X x X} U n p > 4. If Comp( A) Comp( C ) then stop. Return D. 5. For each,...,n Call MDL-Cluster(C )

21 Clusterng Reuters-2class Data root ( ) [608, 39] trade0 ( ) [507, 8] rate0 ( ) [339, 8] mone ( ) [48] money mone0 ( ) [9, 8] money rate ( ) [68] currenc0 ( ) [00] money currenc ( ) [68] money trade ( ) [30, 0] market0 ( ) [86, 39] countr ( ) [67, 20] trade countr0 ( ) [9, 9] trade market ( ) [5, 62] bank0 ( ) [94, ] trade bank ( ) [2, 5] money Clusters (leaves): 8 Correctly classfed nstances: 838 (90%) MDL-Cluster Tree: root ( ) [608, 39] trade0 ( ) [507, 8] money trade ( ) [30, 0] market0 ( ) [86, 39] countr ( ) [67, 20] trade countr0 ( ) [9, 9] trade market ( ) [5, 62] bank0 ( ) [94, ] trade bank ( ) [2, 5] money Clusters (leaves): 5 Correctly classfed nstances: 838 (90%)

22 Comparng MDL, EM and k-means Data set EM k-means MDL-Cluster Acc. % No. of Clusters Acc. % No. of Clusters Acc. % reuters reuters-3class reuters-2class trec soybean soybean-small rs onosphere No. of Clusters

23 Concluson MDL-ranker wthout class nformaton performs closely to the InfoGan method, whch essentally uses class nformaton. Thus, our approach can mprove the performance of clusterng algorthms n purely unsupervsed settng. MDL-cluster outperforms EM and k-means on most benchmark data sets. Numerc attrbutes? Subset evaluaton? Non-herarchcal clusterng? Thank You!

Boostrapaggregating (Bagging)

Boostrapaggregating (Bagging) Boostrapaggregatng (Baggng) An ensemble meta-algorthm desgned to mprove the stablty and accuracy of machne learnng algorthms Can be used n both regresson and classfcaton Reduces varance and helps to avod