MDL-Based Unsupervised Attribute Ranking

MDL-Based Unsupervsed Attrbute Rankng Zdravko Markov Computer Scence Department Central Connectcut State Unversty New Brtan, CT 06050, USA http://www.cs.ccsu.edu/~markov/ markovz@ccsu.edu

MDL-Based Unsupervsed Attrbute Rankng Introducton (Attrbute Selecton) MDL-based Clusterng Model Evaluaton Illustratve Example ( play tenns data) Attrbute Rankng Algorthm Herarchcal Clusterng Algorthm Expermental Evaluaton Concluson

Attrbute Selecton Supervsed / Unsupervsed. Fnd the smallest set of attrbutes that maxmzes predctve accuracy best uncovers nterestng natural groupngs (clusters) n data accordng to the chosen crteron Subset Selecton / Rankng (Weghtng) Computatonally expensve: 2 m attrbute sets for m attrbutes Assumes that attrbutes are ndependent

Supervsed Attrbute Selecton Wrapper methods create predcton models and use the predctve accuracy of these models to measure the attrbute relevance to the classfcaton task. Flter methods drectly measure the ablty of the attrbutes to determne the class labels usng statstcal correlaton, nformaton metrcs, probablstc or other methods. There exst numerous methods n ths settng due to the wde avalablty of model evaluaton crtera n supervsed learnng.

Unsupervsed Attrbute Selecton Wrapper methods evaluate a subset of attrbutes by the qualty of clusterng obtaned by usng these attrbutes. Flter methods explore classcal statstcal methods for dmensonalty reducton, lke PCA and maxmum varance, nformaton-based or entropy measures. There exst very few methods n ths settng generally because of the dffculty to evaluate clusterng models.

Clusterng Model Evaluaton Chapter 4: Evaluatng Clusterng - MDL-Based Model and Feature Evaluaton http://www.cs.ccsu.edu/~markov/ http://www.cs.ccsu.edu/~markov/dmw4.pdf http://www.cs.ccsu.edu/~markov/dmwdata.zp http://www.cs.ccsu.edu/~markov/dmwsoftware.zp

Clusterng Model Evaluaton Consder each possble clusterng as a hypothess H that descrbes (explans) data D n terms of frequent patterns (regulartes). Compute the descrpton length of the data L(D), the hypothess L(H), and data gven the hypothess L(D H). L(H) and L(D) are the mnmum number of bts needed to encode (or communcate) H and D respectvely. L(D H) represents the number of bts needed to encode D f we know H. If we know the pattern of H, no need to encode all ts occurrences n D, rather we may encode only the pattern tself and the dfferences that dentfy each ndvdual nstance n D.

Mnmum Descrpton Length (MDL) and Informaton Compresson The more regularty n D the shorter descrpton length L(D H). Need to balance L(D H) wth L(H), because the latter depends on the complexty of the pattern. Thus the best hypothess should mnmze the sum L(H)+L(D H) (MDL prncple) or maxmze L(D) L(H) L(D H) (Informaton Compresson)

Encodng MDL Hypotheses and data are unformly dstrbuted and the probablty of occurrence of an tem out of n alternatves s /n. Mnmum code length of the message that a partcular tem has occurred s log 2 /n log 2 n bts. The number of bts needed to encode the choce of k tems out of n possble tems s n log2 log n 2 k k

Encodng MDL (attrbute-value) Data D, nstance X D, X s a set of m attrbute values, X m - set of all attrbute values n D, k T Cluster C s defned by the set of all attrbute values T T that occur n ts members, C {X C, X T } Clusterng H {C,C 2,,C n } s defned by {T,T 2,,T n }, k T U D X X T n k k C L 2 log 2 log ) ( + m k C C D L 2 log ) ( + + m k C n k k C MDL 2 2 2 log log log ) ( n C L H L ) ( ) ( n D C L H D L ) ( ) ( n C MDL H MDL ) ( ) (

Play Tenns Data ID outlook temp humdty wndy play sunny hot hgh false no 2 sunny hot hgh true no 3 overcast hot hgh false yes 4 rany mld hgh false yes 5 rany cool normal false yes 6 rany cool normal true no 7 overcast cool normal true yes 8 sunny mld hgh false no 9 sunny cool normal false yes 0 rany mld normal false yes sunny mld normal true yes 2 overcast mld hgh true yes 3 overcast hot normal false yes 4 rany mld hgh true no C {, 2, 3, 4, 8, 2, 4} (humdtyhgh) C 2 {5, 6, 7, 9, 0,, 3} (humdtynormal) T {outlooksunny, outlookovercast, outlookrany, temphot, tempmld, humdtyhgh, wndyfalse, wndytrue} T 2 {outlooksunny, outlookovercast, outlookrany, temphot, tempmld, tempcool, humdtynormal, wndyfalse, wndytrue}.

Clusterng Play Tenns Data k MDL( C ) log + 2 log2 n + C log k 2 k m k T 8, k 2 T 2 9, k 0, m 4, n 2 0 8 MDL( C ) log2 + log2 2 + 7 log2 8 4 0 9 MDL( C ) log2 + log2 2 + 7 log2 9 4 2 49.39 53.6 MDL({C, C 2 }) MDL(humdty) 02.55 bts. MDL(temp) 0.87 2. MDL(humdty) 02.56 3. MDL(outlook) 03.46 4. MDL(wndy) 06.33 Best attrbute s temp

MDL Ranker Let A have values v, v 2,, v p Clusterng {C,C 2,,C p }, where C {X x X} Let V A For each data nstance X {x, x 2,, x m } For each attrbute A For each value x V A V A {x } m A k V j j Compute MDL({C,C 2,,C p }) Incremental (no need to store nstances) Tme O(nm 2 ), n s the number of data nstances Space O(pm 2 ), p s the max number of attrbute values Evaluates 3204 nstances wth 395 attrbutes (trec data) n 3 mnutes.

Expermental Evaluaton Data Data Set Instances Attrbutes Classes reuters 504 2887 3 reuters-3class 46 2887 3 reuters-2class 927 2887 2 trec 3204 395 6 soybean 683 36 9 soybean-small 47 36 4 rs 50 5 3 onosphere 35 35 2 Java mplementatons of MDL rankng and clusterng avalable from http://www.cs.ccsu.edu/~markov/dmwsoftware.zp

Expermental Evaluaton Metrcs Average Precson D r D PrecsonAtRank(k) q k k PrecsonAtRank(k) k k Classes-to-clusters accuracy ( true cluster membershp) r 0 root [5, 9] temperaturehot [2, 2] outlooksunny [2] no outlookovercast [2] yes temperaturemld [4, 2] wndyfalse [2, ] yes wndytrue [2, ] yes temperaturecool [3, ] wndyfalse [2] yes wndytrue [, ] no ----------------------------------------- Clusters (leaves): 6 Correctly classfed nstances: (78%) r f a D q otherwse

Average Precson of Attrbute Rankng Data set D q InfoGan MDL Error Entropy reuters 5 0.383 0.435 0.0642 0.0030 reuters-3class 0 0.3948 0.852 0.257 0.0027 reuters-2class 7 0.506 0.2438 0.788 0.3073 trec 4 0.4890 0.244 0.0637 0.000 soybean 6 0.6265 0.5606 0.387 0.452 soybean-small 2 0.6428 0.3500 0.093 0.23 rs.0000.0000.0000 0.3333 onosphere 9 0.6596 0.504 0.2575 0.4252 D q set of attrbutes selected by Wrapper Subset Evaluator wth Naïve Bayes classfer. InfoGan supervsed attrbute rankng usng Informaton Gan Evaluator. Error unsupervsed rankng based on evaluatng the qualty of clusterng by the sum of squared errors. Entropy unsupervsed rankng based on the reducton of the entropy n data when the attrbute s removed (Dash and Lu 2000).

Classes-To-Clusters Accuracy Wth Reuters Data 60 MDL ranked InfoGan ranked 55 50 45 40 35 30 % Accuracy EM 25 20 2 3 5 0 20 30 50 00 200 300 500 000 2000 2887 75 MDL ranked InfoGan ranked 70 65 60 55 50 % Accuracy k-means 45 40 2 3 5 0 20 30 50 00 200 300 500 000 2000 2887

Classes-To-Clusters Accuracy Wth Reuters-3class Data EM 75 70 65 60 55 50 45 40 35 30 MDL ranked InfoGan ranked 2 3 5 0 20 30 50 00 200 300 500 000 2000 2886 K-means 75 70 65 60 55 50 45 40 MDL ranked InfoGan ranked 2 3 5 0 20 30 50 00 200 300 500 000 2000 2886

Classes-To-Clusters Accuracy Wth Soybean Data EM 80 70 60 50 40 30 20 0 0 MDL ranked InfoGan ranked 36 30 20 0 5 3 2 k-means 60 50 40 30 20 0 0 MDL ranked InfoGan ranked 36 30 20 0 5 3 2

MDL-Based Clusterng Functon MDL-Cluster(D). Choose attrbute A argmn MDL( A ) 2. Let A take values v, v 2,, v p 3. Splt data D C, C { X x X} U n p > 4. If Comp( A) Comp( C ) then stop. Return D. 5. For each,...,n Call MDL-Cluster(C )

Clusterng Reuters-2class Data root (56550.58) [608, 39] trade0 (434956.39) [507, 8] rate0 (26626.68) [339, 8] mone (2254.60) [48] money mone0 (6236.34) [9, 8] money rate (25589.70) [68] currenc0 (70870.68) [00] money currenc (5049.67) [68] money trade (204850.37) [30, 0] market0 (57978.80) [86, 39] countr (6448.90) [67, 20] trade countr0 (06457.20) [9, 9] trade market (06422.43) [5, 62] bank0 (73572.74) [94, ] trade bank (48489.70) [2, 5] money ----------------------------------------- Clusters (leaves): 8 Correctly classfed nstances: 838 (90%) MDL-Cluster Tree: root (56550.58) [608, 39] trade0 (434956.39) [507, 8] money trade (204850.37) [30, 0] market0 (57978.80) [86, 39] countr (6448.90) [67, 20] trade countr0 (06457.20) [9, 9] trade market (06422.43) [5, 62] bank0 (73572.74) [94, ] trade bank (48489.70) [2, 5] money ----------------------------------------- Clusters (leaves): 5 Correctly classfed nstances: 838 (90%)

Comparng MDL, EM and k-means Data set EM k-means MDL-Cluster Acc. % No. of Clusters Acc. % No. of Clusters Acc. % reuters 43 6 3 3 59 2 reuters-3class 58 3 48 3 73 7 reuters-2class 7 2 6 2 90 7 trec 26 6 29 6 44 soybean 60 9 5 9 5 7 soybean-small 00 4 9 4 83 4 rs 95 3 69 3 96 3 onosphere 89 2 8 2 80 3 No. of Clusters

Concluson MDL-ranker wthout class nformaton performs closely to the InfoGan method, whch essentally uses class nformaton. Thus, our approach can mprove the performance of clusterng algorthms n purely unsupervsed settng. MDL-cluster outperforms EM and k-means on most benchmark data sets. Numerc attrbutes? Subset evaluaton? Non-herarchcal clusterng? Thank You!