Evaluating Data Minability Through Compression An Experimental Study

Size: px

Start display at page:

Download "Evaluating Data Minability Through Compression An Experimental Study"

Solomon Boyd
5 years ago
Views:

1 Evaluatig Data Miability Through Compressio A Experimetal Study Da Simovici Uiv. of Massachusetts Bosto, Bosto, USA, dsim at cs.umb.edu Da Pletea Uiv. of Massachusetts Bosto, Bosto, USA, dpletea at cs.umb.edu Saaid Baraty Uiv. of Massachusetts Bosto, Bosto, USA, sbaraty at cs.umb.edu Abstract The effectiveess of compressio algorithms is icreasig as the data subjected to compressio cotais repetitive patters. This basic idea is used to detect the existece of regularities i various types of data ragig from market basket data to udirected graphs. The results are quite idepedet of the particular algorithms used for compressio ad offer a idicatio of the potetial of discoverig patters i data before the actual miig process takes place. Keywords-data miig; lossless compressio; LZW; market basket data; patters; Kroecker product. I. INTRODUCTION Our goal is to show that compressio ca be used as a tool to evaluate the potetial of a data set of producig iterestig results i a data miig process. The basic idea that data that displays repetitive patters or patters that occur with a certai regularity will be compressed more efficietly compared to data that has o such characteristics. Thus, a pre-processig phase of the miig process should allow to decide whether a data set is worth miig, or compare the iterestigess of applyig miig algorithms to several data sets. Sice compressio is geerally iexpesive ad compressio methods are well-studied ad uderstood, pre-miig usig compressio will help data miig aalysts to focus their efforts o miig resources that ca provide a highest payout without a exorbitat cost. Compressio has received lots of attetio i the data miig literature. As observed by Maila [7], data compressio ca be regarded as oe of the fudametal approaches to data miig [7], sice the goal of the data miig is to compress data by fidig some structure i it. The role of compressio developig parameter-free data miig algorithms i aomaly detectio, classificatio ad clusterig was examied i [4]. The size C(x) of a compressed file x is as a approximatio of Kolmogorov complexity [2] ad allows the defiitio of a pseudo-distace betwee two files x ad y as d(x, y) = C(xy) C(x) + C(y). Further advaces i this directio were developed i [8][5][6]. A Kolmogorov complexity-based dissimilarity was successfully used to texture matchig problems i [1] which have a broad spectrum of applicatios i areas like bioiformatics, atural laguages. ad music. We illustrate the use of lossless compressio i pre-miig data by focusig o several distict data miig processes: files with frequet patters, frequet itemsets i market basket data, ad explorig similarity of graphs. The LZW (Lempel-Ziv-Welch) algorithm was itroduced i 1984 by T. Welch i [9] ad is amog the most popular compressio techiques. The algorithm does ot eed to check all the data before startig the compressio ad the performace is based o the umber of the repetitios ad the legths of the strigs ad the ratio of 0s/1s or true/false at the bit level. There are several versios of the LZW algorithm. Popular programs (such as Wizip) use variatios of the LZW compressio. The Wizip/Zip type of algorithms also work at the bit level ad ot at a charater/byte level. We explore three experimetal settigs that provide strog empirical evidece of the correlatio betwee compressio ratio ad the existece of hidde patters i data. I Sectio II, we compress biary strigs that cotai patters; i Sectio III, we study the compressibility of adjacecy matrix for graphs relative to the etropy of distributio of subgraphs. Fially, i Sectio IV, we examie the compressibility of files that cotai market basket data sets. II. PATTERNS IN STRINGS AND COMPRESSION Let A be the set of strigs o the alphabet A. The legth of a strig w is deoted by w. The ull strig o A is deoted by λ ad we defie A + as A + = A {λ}. If w A ca be writte as w = utv, where u, v A ad t A +, we say that the pair (t, m) is a occurrece of t i w, where m is the legth of u. The occurreces (x, m) ad (y, p) are overlappig if p < m + x. If this is the case, there is a proper suffix of x that equals a proper prefix of y. If t is a word such that the sets of its proper prefixes ad its proper suffixes are disjoit, there are o overlappig occurreces of x i ay word. The umber of occurreces of a strig t i a strig w is deoted by t (w). Clearly, we have { a (w) a A} = w. The prevalece of t i w is the umber f t (w) = t(w) t w which gives the ratio of the characters cotaied i the occurreces of t relative to the total umber of characters i the strig.

2 The result of applyig a compressio algorithm C to a strig w A is deoted by C(w) ad the compressio ratio is the umber CR C (w) = C(w). w I this sectio, we shall use the biary alphabet B = {0, 1} ad the LZW algorithm or the compressio algorithm of the package java.util.zip. We geerated radom strigs of bits (0s ad 1s) ad computed the compressio ratio strigs with a variety of symbol distributios. A strig w that cotais oly 0s (or oly 1s) achieves a very good compressio ratio of CR jzip (w) = for 100K bits ad CR jzip = for 500K bits, where jzip deotes the compressio algorithm from the package java.util.zip. Figure 1 shows, as expected, that the worst compressio ratio is achieved whe 0s ad 1s occur with equal frequecies. Table I PATTERN 001 PREVALENCE VERSUS THE CR jzip Prevalece of CR jzip Baselie 001 patter 0% % % % % % % % % % % % Figure 2. Variatio of compressio rate depeds o the prevalece of the patter 001 Figure 1. Baselie CR jzip Behavior For strigs of small legth (less tha 10 4 bits) the compressio ratio may exceed 1 because of the overhead itroduced by the algorithm. However, whe the size of the radom strig exceeds 10 6 bits this pheomeo disappears ad the compressio ratio depeds oly o the prevalece of the bits ad is relatively idepedet o the size of the file. Thus, i Figure 1, the curves that correspod to files of size 10 6 ad overlap. We refer to the compressio ratio of a radom strig w with a ( 0 (w), 1 (w)) distributio as the baselie compressio ratio. We created a series of biary strigs ϕ t,m which have a miimum guarateed umber m of occurreces of patters t {0, 1} k, where 0 m 100. Specifically, we created 101 files ϕ 001,m for the patter 001, each cotaiig 100K bits ad we geerated similar series for t {01, 0010, 00010}. The compressio ratio is show i Figure 2. The compressio ratio starts at a value of 0.94 ad after the prevalece of the patter becomes more frequet tha 20% the compressio ratio drops dramatically. Results of the experimet are show i Table I ad i Figure 3. Figure 3. Depedecy of Compressio Ratio o Patter Prevalece We coclude that the presece of repeated patters i strigs leads to a high degree of compressio (that is, to low compressio ratios). Thus, a low compressio ratio for a file idicates that the miig process may produce iterestig results. III. RANDOM INSERTION AND COMPRESSION For a matrix M {0, 1} u v deote by i (M) the umber of etries of M that equal i, where i {0, 1}. Clearly, we have 0 (B)+ 1 (B) = uv. For a radom variable V which

3 rages over the set of matrices {0, 1} u v let ν i (V ) be the radom variable whose values equal the umber of etries of V that equal i, where i {0, 1}. Let A {0, 1} p q be a 0/1 matrix ad let ( ) B1 B B : 2 B k, p 1 p 2 p k be a matrix-valued radom variable where B j R r s, p j 0 for 1 j k, ad k j=1 p j = 1. Defiitio 3.1: The radom variable A B obtaied by the isertio of B ito A is give by a 11 B... a 1 B A B =..... R mr s a m1 B... a m B I other words, the etries of A B are obtaied by substitutig the block a ij B l with the probability p l for a ij i A. Note that this operatio is a probabilistic geeralizatio of Kroecker s product for if ( ) B1 B :, 1 the A B has as its uique value the Kroecker product A B. The expected umber of 1s i the isertio A B is k E[ν 1 (A B)] = 1 (A) 1 (B j )p j j=1 Whe 1 (B 1 ) = = 1 (B k ) =, we have E[ν 1 (A B)] = 1 (A). I the experimet that ivolves isertio, we used a matrix-valued radom variable such that 1 (B 1 ) = = 1 (B k ) =. Thus, the variability of the values of A B is caused by the variability of the cotets of the matrices B 1,..., B k which ca be evaluated usig the etropy of the distributio of B, k H(B) = p j log 2 p j. j=1 We expect to obtai a strog positive correlatio betwee the etropy of B ad the degree of compressio achieved o the file that represets the matrix A B, ad the experimets support this expectatio. I a first series of compressios, we worked with a matrix A {0, 1} ad with a matrix-valued radom variable ( ) B1 B B : 2 B 3, p 1 p 2 p 3 where B j {0, 1} 3 3, ad 1 (B 1 ) = 1 (B 2 ) = 1 (B 3 ) = 4. Several probability distributios were cosidered, as show i Table II. Values of A B had = etries. I Table II, we had 39% 1s ad the baselie compressio rate for a biary file with this ratio of 1s is We also computed the correlatio betwee the CR jzip ad the Shao etropy of the probability distributio ad obtaied the value for 3 matrices. I Table III, we did the same experimet but with 4 differet matrices of 4 4. A strog correlatio (0.992.) was observed betwee CR jzip ad the Shao etropy of the probability distributio. Table II MATRIX INSERTIONS, ENTROPY AND COMPRESSION RATIOS Probability distributio CR jzip Shao Etropy (0, 1, 0) (1, 0, 0) (0, 0, 1) (0.2, 0.2, 0.6) (0.6, 0.2, 0.2) (0.33, 0.33, 0.34) (0, 0.3, 0.7) (0.9, 0.1, 0) (0.8, 0, 0.2) (0.49, 0.25, 0.26) (0.15, 0.35, 0.5) Table III KRONECKER PRODUCT AND PROBABILITY DISTRIBUTION FOR 4 MATRICES Probability distributio CR jzip Shao Etropy (0, 1,0, 0) (0, 1,0, 0) (0.2, 0.2, 0.2, 0.4) (0.25, 0.25, 0.25, 0.25) (0.4, 0, 0.2, 0.4) (0.3, 0.1, 0.2, 0.4) (0.45, 0.12, 0.22, 0.21) Figure 4. Evolutio CR jzip ad Shao Etropy of Probability Distributio. I Figure 4, we have the evolutio of CR jzip o the y axis ad o the x axis the Shao Etropy of the probability distributio for both experimets. We ca see clearly the liear correlatio betwee the two.

4 This experimet proves us agai that i case of repetitios/patters the CR jzip is better tha i the case of radomly geerated files. Next, we examie the compressibility of biary square matrices ad its relatioship with the distributio of pricipal submatrices. A biary square matrix is compressed by first vectorizig the matrix ad the compressig the biary sequece. The issue is relevat i graph theory, where the pricipal submatrices of the adjacecy matrix of a graph correspod to the adjacecy matrices of the subgraphs of that graph. The patters i a graph are captured i the form of frequet isomorphic subgraphs. There is a strog correlatio betwee the compressio ratio of the adjacecy matrix of a graph ad the frequecies of the occurreces of isomorphic subgraphs of it. Specifically, the lower the compressio ratio is, the higher are the frequecies of isomorphic subgraphs ad hece the worthier is the graph for beig mied. Let G be a udirected graph havig {v 1,..., v } as its set of odes. The adjacecy matrix of G, A G {0, 1} is defied as { 1 if there is a edge betwee v i ad v j i G (A G ) ij = 0 otherwise. We deote with CR C (A G ) the compressio ratio of the adjacecy matrix of graph G obtaied by applyig the compressio algorithm C. Defie the pricipal subcompoet of matrix A G with respect to the set of idices S = {s 1,..., s k } {1, 2,..., } to be the k k matrix A G (S) such that 1 if there is a edge betwee v si ad v sj A G (S) ij = i G 0 otherwise. The matrix A G (S) is the adjacecy matrix of the subgraph of G which cosists of the odes with idices i S alog with those edges that coect these odes. We deote by P (k) the collectio of all subsets of {1, 2,..., } of size k where 2 k. We have P (k) = ( k). Let (M k 1,..., M k l k ) be a eumeratio of possible adjacecy matrices of graphs with k odes where l k = 2 k(k 1) 2. We defie the fiite probability distributio ( ) k 1 P(G, k) = (G ) P (k),..., k l k (G ), P (k) where k i (G ) for 1 i l k is the umber of subgraphs of G with adjacecy matrix M k i. The Shao etropy of this probability distributio is: H P (G, k) = l k i=1 k i (G ) P (k) log k i (G ) 2 P (k). If H P (G, k) is low, there are to be fewer ad larger sets of isomorphic subgraphs of G of size k. I other words, small values of H P (G, k) for various values of k suggest that the graph G cotais repeated patters ad is susceptible to produce iterestig results. Note that although two isomorphic subgraphs do ot ecessarily have the same adjacecy matrix, the umber H P (G, k) is a good idicator of the frequecy of isomorphic subgraphs ad hece subgraph patters. We evaluated the correlatio betwee CR jzip (A G ) ad H P (G, k) for differet values of k. As expected, the compressio ratio of the adjacecy matrix ad the distributio etropy of graphs are roughly the same for isomorphic graphs, so both umbers are characteristic for a isomorphism type. If is a permutatio of the vertices of G, the adjacecy matrix of the graph G obtaied by applyig the permutatio is defied by A G is give by A G = P A G P 1. We compute this adjacecy matrix of A G, the etropy H P (G, k) the compressio ratio CR jzip (A G ) for several values of k ad permutatios. We radomly geerated graphs with = 60 odes ad various umber of edges ragig from 5 to For each geerated graph, we radomly produced twety permutatios of its set of odes ad computed H P (G, k) ad CR jzip (A G ). Fially, for each graph we calculated the ratio of stadard deviatio over average for the computed compressio ratios, followed by the same computatio for distributio etropies. The results of this experimet are show i Figures 5 ad 6 agaist the umber of edges. As it ca be see, the deviatio over mea of the compressio ratios for = 60 does ot exceed the umber Also, the deviatio over average of the distributio etropies for various values of k do ot exceed I particular, the deviatio of the distributio etropy for the graphs of 100 to 1500 edges falls below 0.001, which allows us to coclude that the deviatios of both compressio ratio ad distributio etropy with respect to isomorphisms are egligible. For each k {3, 4, 5}, we geerated radomly 560 graphs havig 60 vertices ad sets of edges whose size were varyig from 10 to The, the umbers H P (G, k) ad CR jzip (A G ) were computed. Figure 7 captures the results of the experimet. Each plot cotais two curves. The first curve represets the chages i average CR jzip (A G ) for forty radomly geerated graphs of equal umber of edges. The secod curve represets the variatio of the average H P (G, k) for the same forty graphs. The treds of these two curves are very similar for differet values of k. Table IV cotais the correlatio betwee CR jzip (A G ) ad H P (G, k) calculated for the 560 radomly geerated graphs for each value of k.

5 Figure 5. Stadard deviatio vs. average of the CR jzip (A G ) for a umber of differet permutatios of odes for the same graph. The horizotal axis is labelled with the umber of edges of the graph. = 60 ad k = 3 Figure 6. Stadard deviatio vs. average of the H P (G, k) of a umber of differet permutatios of odes for the same graph. The horizotal axis is labelled with the umber of edges of the graph. Each curve correspods to oe value of k. IV. FREQUENT ITEMS SETS AND COMPRESSION RATIO = 60 ad k = 4 A market basket data set cosists of a multiset T of trasactios. Each trasactio t is a subset of a set of items I = {i 1,...,i N }. A trasactio is described by its characteristic N-tuple t = (t 1,..., t N ), where { 1 if i k t. t k = 0 otherwise, for 1 k N. The legth of a trasactio t is t = N k=0 t k, while the average size of trasactios is { t t i T } T. The support of a set of items K of the data set T is {t T K t the umber supp(k) = T. The set of items K is s-frequet if supp(k) > s. The study of market basket data sets is cocered with the idetificatio of associatio rules. A pair of item sets (X, Y ) is a associatio rule. Its support, supp(x Y ) equals supp(x) ad its cofidece cof(x Y ) is defied as cof(x Y ) = supp(xy ) supp(x). = 60 ad k = 5 Figure 7. Plots of average CR jzip (A G ) (CMP RTIO) ad average H P (G, k) (DIST ENT) for radomly geerated graphs G of equal umber of edges with respect to the umber of edges.

6 Table IV CORRELATIONS BETWEEN CR jzip (A G ) AND H P (G, k) k Correlatio Usig the artificial trasactio ARMier geerator described i [3], we created a basket data set. Trasactios are represeted by sequeces of bits (t 1,, t N ). The multiset of M trasactios was represeted as a biary strig of legth M N obtaied by cocateatig the strigs that represet trasactios. We geerated files with 1000 trasactios, with 100 items available i the basket, addig up to 100K bits. For data sets havig the same umber of items ad trasactios, the efficiecy of the compressio icreases whe the umber of patters is lower (causig more repetitios). I a experimet with a average size of a frequet item set equal to 10, the average size of a trasactio equal to 15, ad the umber of frequet item sets varyig i the set {5, 10, 20, 30, 50, 75, 100, 200, 500, 1000}, the compressio ratio had a sigificat variatio ragig betwee 0.20 ad 0.75, as show i Table V. The correlatio betwee the umber of patters ad CR was Although the frequecy of 1s ad baselie compressio ratio were roughly costat (at 0.75), the umber of patters ad compressio ratio were correlated. Table V NUMBER OF ASSOCIATION RULES AT 0.05 SUPPORT LEVEL AND 0.9 CONFIDENCE Number of Patters Frequecy of 1s Baselie Compressio Number of assoc. compressio ratio rules 5 16% ,128, % ,539, % ,233, % , % ,910, % , % , % % % ivestigatios show that idetifyig compressible areas of huma DNA is a useful tool for detectig areas where the gee replicatio mechaisms are disturbed (a pheomeo that occurs i certai geetically based diseases. REFERENCES [1] B. J. L. Campaa ad E. J. Keogh. A compressio based distace measure for texture. I SDM, pages , [2] R. Cilibrasi ad P. M. B. Vitayi. Clusterig by compressio. IEEE Trasactios o Iformatio Theory, 51: , [3] L. Cristofor. ARMier project, [4] E. Keogh, S. Loardi, ad C. A. Rataamahataa. Towards parameter-free data miig. I Proc. 10th ACM SIGKDD Itl Cof. Kowledge Discovery ad Data Miig, pages ACM Press, [5] E. Keogh, S. Loardi, C. A. Rataamahataa, L. Wei, S. Lee, ad J. Hadley. Compressio-based data miig of sequetial data. Data Miig ad Kowledge Discovery, 14:99 129, [6] E. J. Keogh, L. Keogh, ad J. Hadley. Compressiobased data miig. I Ecyclopedia of Data Warehousig ad Miig, pages [7] H. Maila. Theoretical frameworks for data miig. SIGKDD Exploratio, 1:30 32, [8] L. Wei, J. Hadley, N. Marti, T. Su, ad E. J. Keogh. Clusterig workflow requiremets usig compressio dissimilarity measure. I ICDM Workshops, pages 50 54, [9] T. Welch. A techique for high performace data compressio. IEEE Computer, 17:8 19, Further, there was a strog egative correlatio (-0.92) betwee the compressio ratio ad the umber of associatio rules idicatig that market basket data sets that satisfy may associatio rules are very compressible V. CONCLUDING REMARKS Compressio ratio of a file ca be computed fast ad easy, ad i may cases offers a cheap way of predictig the existece of embedded patters i data. Thus, it becomes possible to obtai a approximative estimatio of the usefuless of a i-depth exploratio of a data set usig more sophisticated ad expesive algorithms. The use of compressio as a measure of miability is illustrated o a variety of paradigms: graph data, market basket data, etc. Recet

Evaluating Data Minability Through Compression An Experimental Study

Evaluating Data Minability Through Compression An Experimental Study Evaluatig Data Miability Through Compressio A Experimetal Study Da Simovici Uiv. of Massachusetts Bosto, Bosto, USA, dsim at cs.umb.edu Da Pletea Uiv. of Massachusetts Bosto, Bosto, USA, dpletea at cs.umb.edu