Improved Similarity Measures For Software Clustering

Size: px

Start display at page:

Download "Improved Similarity Measures For Software Clustering"

Barnard Watson
6 years ago
Views:

1 th European Conference on Software Maintenance and Reengineering Improved Similarity Measures For Software Clustering Rashid Naseem, Onaiza Maqbool, Siraj Muhammad Dept. of Computer Science, Quaid-I-Azam University, Islamabad Elixir Technologies Pakistan (PVT) LTD Abstract Software clustering is a useful technique to recover architecture of a software system. The results of clustering depend upon choice of entities, features, similarity measures and clustering algorithms. Different similarity measures have been used for determining similarity between entities during the clustering process. In software architecture recovery domain the Jaccard and the Unbiased Ellenberg measures have shown better results than other measures for binary and non-binary features respectively. In this paper we analyze the Russell and Rao measure for binary features to show the conditions under which its performance is expected to be better than that of Jaccard. We also show how our proposed Jaccard-NM measure is suitable for software clustering and propose its counterpart for non-binary features. Experimental results indicate that our proposed Jaccard-NM measure and Russell & Rao measure perform better than Jaccard measure for binary features, while for non-binary features, the proposed Unbiased Ellenberg-NM measure produces results which are closer to the decomposition prepared by experts. Index Terms Software Clustering, Jaccard-NM Measure, Jaccard Measure, Unbiased Ellenberg-NM Measure, Russell & Rao Measure I. INTRODUCTION Software clustering has engaged the interest of researchers in the last two decades, primarily as a technique to facilitate understanding of legacy software systems. When the architectural documentation is not available, or the documentation has not been updated to reflect changes in the software over time, software clustering may be used for software modularization and architecture recovery [1], [2]. Besides clustering, other techniques used for this purpose are association rule mining [3], concept analysis [4] and graphical visualization [5]. The clustering process is used to modularize a software system or to recover sub-systems by grouping together software entities that are similar to each other. Thus entities within a cluster have similar characteristics or features, and are dis-similar from entities in other clusters. To determine similarity based on features of an entity, a similarity measure is employed. Many different similarity measures are available. The choice of a measure depends on the characteristics of the domain in which they are applied. In the software domain, the most commonly used similarity measure for hierarchical clustering is Jaccard coefficient for binary features [6], [7], while for non-binary features Unbiased Ellenberg and Information Loss measure have been shown to produce better results as compared to other measures [1]. In this paper, we describe our proposed Jaccard-NM [8] measure for binary features, and compare it with the Jaccard and Russell & Rao [9] measures. We present different cases to show deficiencies in the Jaccard measure which may deteriorate clustering results, and show how in these cases the Russell & Rao and Jaccard-NM measures are expected to have better performance. For non-binary features we propose the Unbiased Ellenberg-NM measure, and compare its performance with the Unbiased Ellenberg and Information Loss measures. We also analyze cases where these measures produce arbitrary decisions. We call a decision arbitrary when more than two entities have equal similarity value. In this situation, clustering algorithms select two entities to be clustered arbitrarily. This arbitrary decision may create problems [1]. Thus the contributions of this paper can be summarized as: 1) Analysis of the Jaccard, Jaccard-NM and Russell & Rao measures for binary features and a comparison of their strengths and weaknesses. 2) Definition of a new similarity measure for non-binary features and its comparison with well known existing measures used for software clustering. 3) Internal and external evaluation of clustering results. Internal assessment is carried out using arbitrary decisions taken by proposed and existing similarity measures. External assessment is carried out by comparing manually prepared software decompositions with automatically produced clustering results using MoJoFM. This paper is organized as follows. In Section 2 we describe related work. An overview of clustering is presented in Section 3. In Section 4 we present an analysis of similarity measures for binary features and define a new measure for non-binary features. Section 5 describes our experimental setup. In Section 6, we analyze our experimental results. Finally in Section 7, we present the conclusions and future work. II. RELATED WORK To find similarity between entities, various similarity measures have been used. Davey and Burd evaluated different similarity measures including Jaccard, Sorensen-Dice, Canberra and Correlation coefficients [7]. From experimental results they concluded that Jaccard and Sorensen-Dice similarity measures perform identically and they recommended Jaccard similarity measure for software clustering when features are binary /11 $ IEEE DOI /CSMR

2 Anquetil and Lethbridge compared different similarity measures including Jaccard, Simple Matching, Sorensen-Dice, Correlation, Taxonomic and Canberra [6]. For clustering they used Complete linkage, Weighted linkage, Unweighted linkage and Single linkage algorithms. They concluded that Jaccard and Sorensen-Dice similarity measures produce good results because they do not consider absence of a feature (d) as a sign of similarity, while Simple Matching and other similarity measures consider absence of a feature to be a sign of similarity and thus do not produce satisfactory results. In 2003 Anquetil and Lethbridge evaluated different features, similarity measures and clustering algorithms. From experimental results they once again concluded that Jaccard similarity measure produces good results [10]. Saeed et al. developed a new linkage algorithm called Combined algorithm [11]. They compared this algorithm with Complete linkage using different similarity measures including Jaccard, Sorensen-Dice, Simple Matching and Correlation coefficient. They concluded that behavior of Correlation coefficient is similar to the Jaccard similarity measure when the number of absent feature is very large as compared to present features. In 2004, Maqbool and Babri developed the Weighted Combined algorithm, and proposed the Unbiased Ellenberg similarity measure [12]. In this paper they evaluated Complete linkage, Combined algorithm and Weighted Combined algorithms using Jaccard, Euclidean distance, Pearson correlation coefficient, Ellenberg and the Unbiased Ellenberg similarity measures. Their results suggested that Weighted Combined algorithm produces better results than Complete and Combined algorithms especially with Unbiased Ellenberg measure. Andritsos and Tzerpos developed an algorithm called LIMBO (scalable InforMation BOtleneck algorithm) in 2005 [2]. They applied LIMBO to three different data sets and compared the results with ACDC, NAHC-lib, SAHC, SAHClib, Single Linkage, Complete Linkage, Weighted Average Linkage and Unweighted Average Linkage algorithms. They concluded that, on an average, LIMBO performed better than other algorithms. In 2006, Mitchell and Mancoridis described their Bunch clustering tool which uses search techniques (hill-climbing and genetic algorithms) to find optimal solutions [13]. Bunch tool was developed in 1998 [14] and over time modified to include new features (e.g omnipresent modules detection and deletion) [15], [16]. Bunch tool uses Module Dependency Graph (MDG), where modules are entities and edges are static relationships among entities. Bunch makes partitions of MDG and uses a fitness function Modularization Quality (MQ), to calculate the quality of graph partitions. Harman et al. investigated the effect of noise in input information available for software module clustering [17]. To guide the search they examine two fitness functions: Modularization Quality (MQ) and Evaluation Metric function (EVM). For evaluation purpose they used six real software systems, three Perfect module dependency graphs and three Random module dependency graphs and concluded that in the presence of noise EVM performs better than MQ for real and perfect MDGs. Results also show that EVM is more robust than MQ for smaller software systems. In 2010 Naseem et al. proposed a new similarity measure called Jaccard-NM [8] for binary features. They evaluated this measure using Complete linkage, Weighted average and Unweighted average. From the experimental results they concluded that, in general, Jaccard-NM produces better results than Jaccard similarity measure for binary features. Besides the software domain, a comparison of similarity measures has also been carried out in other domains. In these domains, the Jaccard measure does not necessarily perform better than other measures as in the case of software due to different domain characteristics. For example, Willett used thirteen similarity measures including Tanimoto, Russell & Rao, and Simple matching to find the similarity between molecular fingerprints for virtual screening [18]. He concluded that Tanimoto, Baroni-Urbani/Buser, Kulczynski(2), Fossum and Ochiai/Cosine coefficients perform reasonably well across the range of molecular size [18]. Dalirsefat1a et al. compared three similarity measures including Jaccard, Sorensen-Dice and Simple Matching to find the similarity between biological organisms. They concluded from their experimental results that when the organisms are closely related then Jaccard or Sorensen-Dice give satisfactory results [19]. Moreover Jaccard and Sorensen-Dice produce closely similar results because these two measures exclude negative co-occurrences. These results are similar to those obtained for software. III. OVERVIEW OF CLUSTERING In the clustering process, entities are grouped together based on their features. In this section, we provide an overview of the steps in clustering. A. Selection of Entities and Features Selection of entities and features depend on type of software system, and also on the required architectural view. For modularization of structured software systems, researchers have selected different entities e.g. files, processes and functions. Features may be global variables or user defined types used by an entity [6]. For object oriented software systems, entities may be classes [20] and features are typically defined by the relationships between classes e.g. inheritance or containment. In the software domain, features are usually binary i.e. they indicate the presence or absence of a characteristic or relationship. Before applying a clustering algorithm, a software system must be parsed to extract entities and features. The result is an NxP matrix, where N is the number of entities and P is the number of features. Table I presents an NxP matrix of a small software system containing 4 entities and 6 binary features. B. Selection of Similarity Metrics In the second step, a similarity measure is applied to compute similarity between every pair of entities, resulting in a similarity matrix. Selection of similarity measure should 47 46

3 TABLE I (N X P) FEATURE MATRIX FOR A SMALL SYSTEM TABLE IV SIMILARITY MEASURES FOR NON-BINARY FEATURES f1 f2 f3 f4 f5 f E E E be done carefully, because selecting an appropriate similarity measure may influence clustering results more than selection of a clustering algorithm [21]. Table II lists some well known similarity measures for binary features. TABLE II SIMILARITY MEASURES FOR BINARY FEATURES S. Name Mathematical representation No 1 Jaccard a/(a + b + c) 2 Russell & Rao a/(a + b + c + d) 3 Simple Matching (a + d)/(a + b + c + d) 4 Sokal Sneeth a/(a + 2(b + c)) 5 Rogers-Tanimoto (a + d)/a + 2(b + c) + d) 6 Gower-Legendre (a + d)/(a + 0.5(b + c) + d) The Jaccard-NM measure for binary features proposed by us in [8] is given by: a Jaccard NM = (1) 2(a + b + c) + d In Table II and Equation 1, a, b, c and d can be determined using Table III. For two entities X and Y, a is the number of features that are present 1 in both entities X and Y, b represents features that are present in X but absent in Y, c represents features that are not present in X and present in Y, and d represents the number of features that are absent 0 in both entities. n = a + b + c + d is the total number of features. X TABLE III CONTINGENCY TABLE Y 1 (Presence) 0 (Absence) Sum 1 (Presence) a b a+b 0 (Absence) c d c + d Sum a + c b + d n = a + b + c + d Table IV lists some well known similarity measures for non-binary features. In Table IV, since the features are nonbinary, Ma represents the sum of features that are present in both entities X and Y, Mb represents sum of features that are present in X but absent in Y and Mc represents sum of features that are not present in X and are present in Y. In the software domain, it has been shown that Jaccard measure produces better results than other measures for binary features [6], [7]. One reason for this is that it does not consider d (absence of feature/negative match) [11], [22]. It S. Name Mathematical representation No 1 Ellenberg 0.5 Ma/(0.5 Ma + Mb + Mc) 2 Unbiased Ellenberg 0.5 Ma/(0.5 Ma + b + c) 3 Gleason Measure Ma/(Ma + Mb + Mc) 4 Unbiased Gleason Ma/(Ma + b + c) measure has been observed that in software clustering, the features are asymmetric, i.e. the presence of a feature 1 has more weight than its absence 0. The absence of features does not indicate similarity between two entities e.g. if two classes both do not use a variable, it does not mean that they are similar. For nonbinary features the counter part of Jaccard similarity measure Unbiased Ellenberg produces better results for software clustering [1], [12]. C. Application of a Clustering Algorithm The next step is to apply a clustering algorithm, which can be categorized into hierarchical or non-hierarchical. Agglomerative Hierarchical Clustering (AHC) algorithms are based on the bottom-up approach. In this approach, an algorithm considers entities as singleton clusters, and at every step clusters the two most similar entities together. At the end, the algorithm makes one large cluster which contains all entities. Although in the software domain, non-hierarchical algorithms have also been used [23], [24], but there are some advantages of using AHC algorithms. For example, there is no need of prior information about number of clusters. Moreover, the hierarchical structure of a software system is naturally represented through hierarchical algorithms. But the disadvantage is that we have to select a cutoff point, which represents the number of steps after which to stop the algorithm. Widely used agglomerative hierarchical algorithms for software architecture recovery are Complete Linkage (CL), Single Linkage (SL), Weighted Average Linkage (WAL) and Unweighted Average Linkage (UWAL). When two entities are merged into a cluster, similarity between the newly formed cluster and other clusters/entities is calculated differently by these algorithms. Suppose we have three entities E 1, E 2 and E 3. Using these algorithms, similarity between E 1 and newly formed cluster E 23 is calculated as [22]: Complete Linkage Similarity(E 1, E 23 ) = min(similarity(e 1, E 2 ), Similarity(E 1, E 3 )). Single Linkage Similarity(E 1, E 23 ) = max(similarity(e 1, E 2 ), Similarity(E 1, E 3 )). Weighted Average Linkage Similarity(E 1, E 23 ) = (1/2 Similarity(E 1, E 2 ) + 1/2 Similarity(E 1, E 3 )). Unweighted Average Linkage Similarity(E 1, E 23 ) = (Similarity(E 1, E 2 ) 48 47

4 size(e 2 )+Similarity(E 1, E 3 ) size(e 3 ))/(size(e 2 )+ size(e 3 ). The Complete linkage algorithm supports formation of small but cohesive clusters, while the Single linkage algorithm makes large non-cohesive but stable clusters. The results of Weighted and Unweighted Average Linkage algorithms lie between these two. Two recently proposed hierarchical algorithms for software clustering are Weighted Combined Algorithm (WCA) [12] and LIMBO [2]. When two entities are merged in a cluster, information about the number of entities accessing a feature is lost [12] when using linkage algorithms. WCA and LIMBO overcome this limitation of linkage algorithms by making a new feature vector for the newly formed cluster. This feature vector contains information about number of entities accessing a feature. Unlike linkage algorithms, these algorithms update feature matrix after every step. Suppose we have two entities E 1 and E 2 with normalized feature vectors f i and f j, respectively. The newly feature vector f ij is calculated for both algorithms as: f ij = (f i + f j ) /(n i + n j ) = (f ik + f jk ) /(n i + n j ), k=1,2,...,p Information Loss (IL) measure is used with LIMBO to calculate the information loss between any two entities/clusters. The entities are chosen for grouping together into a new cluster when their IL is minimum. The IL represented by δi, is briefly described below (For detail and examples see [2]). Information loss is given as: δi=[p(e i) + p(e j)]*d js[f i, f j] For each singleton entity, p(e i ) = p(e j ) = 1/n, where n is the total number of entities. D js is the Jensen-Shannon divergence, defined as follows: D js = p(e i)/p(e ij)*dkl[f i f ij] + p(e j)/p(e ij)*dkl[f j f ij] Dkl is the relative entropy (also called Kullback-Leibler (KL) divergence), which is the difference between two probability distributions, given as: Dkl[f i f j] = p k=1 fik log ( ) f ik / f jk D. Evaluation of Results In external assessment, the automatically prepared decompositions are compared with the decompositions prepared by human experts. For this purpose different measures may be used. A well known measure is MoJoFM [25], a recent version of MoJo [26]. MoJoFM is an external assessment measure which calculates the percentage of Move and Join operations to convert the decomposition produced by a clustering algorithm to an expert decomposition [25]. To compare the result A of our algorithm with expert decomposition B, we have: MoJoF M(M) = ( 1 mno(a, B) max( mno(a, B)) ) 100 (2) where mno(a, B) is the minimum number of move and join operations needed to convert from A to B and max( mno(a, B)) is the minimum number of possible move and join operations needed to convert from A to B. A higher MoJoFM (100%) value denotes greater correspondence between the two decompositions and hence better results while lower MoJoFM (0%) values indicate that decompositions are very different. In internal assessment, some internal characteristic of clusters may be used to evaluate quality of results. Arbitrary decisions represent an internal quality measure [1]. Arbitrary decision is taken by an algorithm when there are more than one maximum values for similarity between entities (or for distance and information loss measures, there are more than one minimum values). IV. AN ANALYSIS OF SIMILARITY MEASURES AND FEATURE VECTOR CASES In this section, we analyze similarity measures for binary features and propose a new measure for non-binary features. A. Analysis of similarity measures As described in Section III-B, for software clustering, measures that do not contain d produce better results. This is because features in software are asymmetric, and a 1 and a 0 do not have equal weight. 0 indicates the absence of a feature, and hence d indicates that features are not being shared between entities. For software, the absence of a feature in two entities does not indicate similarity. For example, if two classes do not access the same global function, it does not mean that the two classes are similar. To show that the presence of d in a measure does not necessarily deteriorate results, consider Table V which shows 4 entities, E1-E4. E1 and E2 share two features, so that value of a is 2. Both of them access one feature each that the other entity does not, so b = 1 and c = 1. E3 and E4 share three features, so a = 3. Similar to E1 and E2, both of them access one feature each that the other entity does not, so b = 1 and c = 1, as given in Figure 1. TABLE V SOFTWARE SYSTEM A Entities f1 f2 f3 f4 f5 f6 f7 E E E E TABLE VI SIMILARITY MATRIX USING JACCARD FOR SOFTWARE SYSTEM A E E E

5 features. Consider the following two cases which indicate how the presence of d in Jaccard-NM and Russell & Rao measure may improve performance as compared to Jaccard. - Case1: Value of a is different among entities, but similarity as per Jaccard is same. TABLE X SOFTWARE SYSTEM B Fig. 1. Relationships between entities in software system A TABLE VII SIMILARITY MATRIX USING JACCARD-NM FOR SOFTWARE SYSTEM A Entities f1 f2 f3 f4 E E E E E E E TABLE VIII SIMILARITY MATRIX USING RUSSELL & RAO FOR SOFTWARE SYSTEM A E E E TABLE IX SIMILARITY MATRIX USING SIMPLE MATCHING FOR SOFTWARE SYSTEM A E E E The similarity matrix according to the Jaccard measure is given in Table VI. The similarity matrices according to the Jaccard-NM, Russell & Rao and Simple Matching measures (all of which contain d) are given in Table VII - Table IX. It can be seen from Table VI - Table VIII that Jaccard, Jaccard- NM and Russell & Rao measures find E3 and E4 to be most similar. From Figure 1, it is clear that E3 and E4 should indeed be considered most similar. However, due to presence of d in numerator of Simple Matching coefficient, it finds E1 & E2 and E3 & E4 to be equally similar, resulting in an arbitrary decision where either of these entities may be grouped. From this example, it is clear that the significant factor here is whether d is present in numerator or denominator of a measure. Its presence in the numerator deteriorates results (as for Simple Matching Coefficient). However, if it is present in denominator only, it does not indicate similarity but it is a useful indicator of the proportion of common and total Fig. 2. Relationships between entities in software system B An example feature matrix with 4 entities (E1-E4) and 4 features (f1-f4) of a software system B for this case is presented in Table V and shown in Figure 2. In this system value of a is 2 for entities E1 and E2. For entities E3 and E4, value of a is 4. The corresponding similarity matrices using Jaccard, Jaccard-NM and Russell & Rao measures are given in Table XI - Table XIII. It can be seen from Table XI that using the Jaccard measure, both E1 and E2, and E3 and E4 are found to be equally similar. It may be better to choose E3 and E4 for clustering rather than E1 and E2 as they share a larger number of features. Both Jaccard-NM and Russell & Rao find E3 and E4 to be more similar so an arbitrary decision is reduced. TABLE XI SIMILARITY MATRIX USING JACCARD FOR SOFTWARE SYSTEM B E2 1 0 E E TABLE XII SIMILARITY MATRIX USING JACCARD-NM FOR SOFTWARE SYSTEM B E E E

6 TABLE XIII SIMILARITY MATRIX USING RUSSELL & RAO FOR SOFTWARE SYSTEM B TABLE XVI SIMILARITY MATRIX USING JACCARD-NM FOR SOFTWARE SYSTEM C E E E E E E Case2: Value of a is high among entities, but they are not completely similar. An example feature matrix with 4 entities (E1-E4) and 9 features (f1-f9) of a software system C for this case is presented in Table XIV and Figure 3. The corresponding similarity matrices using Jaccard measure, Jaccard-NM and Russell & Rao measure are given in Table XV - Table XVII. It can be seen that entities E1 and E2 are found to be most similar by Jaccard. However, Jaccard-NM and Russell & Rao find E3 and E4 to be most similar, which may be more appropriate. TABLE XIV SOFTWARE SYSTEM C Entities f1 f2 f3 f4 f5 f6 E E E E TABLE XVII SIMILARITY MATRIX USING RUSSELL & RAO FOR SOFTWARE SYSTEM C E E E when Russell & Rao already exists. To answer this question, consider the following example: - Case3: Value of a is same but values of b and c is not. Consider Table XVIII and Figure 4 having four entities (E1- E4) and five features (f1-f5). All the entities have same value of a equal to three but entities E1 and E2 have b and c = 0 while E3 and E4 have b = 1 and c = 1. TABLE XVIII SOFTWARE SYSTEM D Entities f1 f2 f3 f4 f5 E E E E Fig. 3. Relationships between entities in software system C TABLE XV SIMILARITY MATRIX USING JACCARD FOR SOFTWARE SYSTEM C Fig. 4. Relationships between entities in software system D E2 1 0 E E Through case1 and case2, we have shown that both the Jaccard-NM and Russell & Rao measures are expected to provide better results as compared to the Jaccard measure. The question arises as to why we need to define Jaccard-NM The corresponding similarity matrices using Russell & Rao and Jaccard-NM measures are given in Table XIX and Table XX respectively. It can be seen from Table XIX that Russell & Rao results in arbitrary decisions among all entities. But it can be seen from Table XX that Jaccard-NM reduces arbitrary decisions and gives preference to E1 and E2 to form a cluster in first step. Hence in certain cases, the results of Jaccard-NM and Russell & Rao are different, with Jaccard-NM reducing the arbitrary decisions which have a negative impact on the clustering results

7 TABLE XIX SIMILARITY MATRIX USING RUSSELL & RAO FOR SOFTWARE SYSTEM D TABLE XXI BRIEF DESCRIPTION OF DATA SETS E E E TABLE XX SIMILARITY MATRIX USING JACCARD-NM FOR SOFTWARE SYSTEM D S. No. PLP SAVT PLC 1 Total number source code lines 2 Total number of header (.h) files 3 Total number of implementation (.cpp,.cxx) files 4 Total number of Classes E E E B. Unbiased Ellenberg-NM - A new similarity measure for non-binary features Unbiased Ellenberg is a Jaccard like measure but for nonbinary features as given in equation 3. U nbiasedellenberg = 0.5 Ma 0.5 Ma + b + c The cases discussed in Section IV-A can also occur in nonbinary features matrix. Therefore to solve these problems, we propose a new measure called Unbiased Ellenberg-NM. Our new measure is defined as follows. UnbiasedEllenberg NM = = (3) 0.5 Ma 0.5 Ma + b + c + n (4) 0.5 Ma 0.5 Ma + b + c + (a + b + c + d) = 0.5 Ma 0.5 Ma + 2(b + c) + a + d) V. EXPERIMENTAL SETUP In this section, we describe the test systems and clustering setup for our experiments. A. The Test Systems To conduct clustering experiments, we selected three object oriented software systems which have been developed in Visual C++ [20]. These are proprietary software systems that run under Windows platforms. Statistical Analysis Visualization Tool (SAVT) is an application which provides functionality related to statistical data and result visualization. Printer Language Converter (PLC) is a part of another system, which provides conversion of intermediate data structures to printer language. Print Language Parser (PLP) is a parser of a well known printer language. It transforms plain text and stores output in intermediate data structures. A brief description is given in Table XXI. (5) (6) B. Entities and Features Since all systems are object-oriented, we selected class as an entity. From different relationships that exist between classes, we selected eleven sibling (indirect) relationships [20] listed in Table XXII, since the similarity measures listed in Table II can only be applied to indirect relationships. We used these relationships because they occur frequently within object-oriented systems. C. Similarity Measures To find out similarity between entities having binary features we selected the Jaccard, Jaccard-NM and Russell & Rao similarity measures. For non-binary features we selected Unbiased Ellenberg and Information Loss measures and compared their results with our new proposed measure Unbiased Ellenberg- NM. D. Algorithms To cluster the most similar entities we selected agglomerative clustering algorithms including Complete linkage, Weighted average and Unweighted average described in Section III-C. We also selected Weighted Combined Algorithm [12] and LIMBO [2]. E. Assessment We obtained expert decompositions for each test system and compared our automatically produced clustering results with the expert decompositions at each step of hierarchical clustering using the MoJoFM [25]. Results are reported by selecting the maximum MoJoFM value obtained during the clustering process. For internal assessment, the results obtained by measures were evaluated internally by number of arbitrary decisions taken during clustering process. VI. EXPERIMENTAL RESULTS AND ANALYSIS A. External evaluation of results for binary features In this section, we present experimental results of Complete Linkage (CL), Weighted Average Linkage (WAL) and Unweighted Average Linkage (UWAL) algorithms using Jaccard (J), Jaccard-NM (JNM) and Russell & Rao (RR) similarity measures. Table XXIII and Figure 5 present the results of the comparison between automatically obtained decomposition and expert decomposition using MoJoFM. From Figure 5 one can see that 52 51

TABLE XXII INDIRECT RELATIONSHIPS BETWEEN CLASSES THAT WERE USED FOR EXPERIMENTS Name Same Inheritance Hierarchy Same Class Containment Same Class in Methods Same Generic Class Same Generic Parameter

8 TABLE XXII INDIRECT RELATIONSHIPS BETWEEN CLASSES THAT WERE USED FOR EXPERIMENTS Name Same Inheritance Hierarchy Same Class Containment Same Class in Methods Same Generic Class Same Generic Parameter Same File Same Folder Same Global Function Access Same Macro Access Same Global Variable Access Description Two or more classes that are derived from same class Represents that classes contain objects of same class Represents classes containing objects of same class declared in a method locally or as parameter Represents that two classes are used as instantiating parameters to same generic class The relationship between two generic classes which have same class as their parameter The source code of two or more classes is written in same file Two or more classes reside in same folder Two or more than two classes access same global functions Two or more than two classes access same macro Two or more than two classes access same global variable in all data sets Jaccard-NM, and Russell & Rao give results equal to or better than Jaccard for all algorithms. From Table XXIII and Figure 6 it can be seen that on an average, Jaccard- NM and Russell & Rao produce significantly better results than the Jaccard similarity measure for all linkage algorithms. Fig. 6. Average MoJoFM using Jaccard(J), Jaccard-NM(JNM) and Russell & Rao(RR) Fig. 5. Experimental results using MoJoFM values for Complete(CL),Unweighted Average(UWAL) and Weighted Average(WAL) using Jaccard(J), Jaccard-NM(JNM) and Russell & Rao(RR) similarity measures Ellenberg-NM measures and Limbo using Information Loss measure. Figure 7 indicates that Unbiased Ellenberg-NM gives better results as compared to Unbiased Ellenberg and Information Loss measures. We analyze the reason for the better results of Unbiased Ellenberg-NM in the next section. TABLE XXIII MOJOFM VALUES OF JACCARD, JACCARD-NM AND RUSSELL & RAO MEASURES FOR ALL DATA SETS AND LINKAGE ALGORITHMS PLP SAVT PLC J JNM RR J JNM RR J JNM RR CL UWAL WAL Average B. External evaluation of results for non-binary features Figure 7 and Table XXIV show results of applying Weighted Combined algorithm using Unbiased Ellenberg and Unbiased Fig. 7. MoJoFM results for Weighted Combined(WC) using Unbiased Ellenberg(UE) and Unbiased Ellenberg-NM(UENM) measures and Information Loss Measure(IL) measures 53 52

TABLE XXIV EXPERIMENTAL RESULTS USING MOJOFM VALUES FOR UNBIASED ELLENBERG (UE) AND UNBIASED ELLENBERG-NM (UENM) USING WEIGHTED COMBINED ALGORITHM AND LIMBO USING INFORMATION LOSS MEASURE FOR ALL

9 TABLE XXIV EXPERIMENTAL RESULTS USING MOJOFM VALUES FOR UNBIASED ELLENBERG (UE) AND UNBIASED ELLENBERG-NM (UENM) USING WEIGHTED COMBINED ALGORITHM AND LIMBO USING INFORMATION LOSS MEASURE FOR ALL DATA SETS PLP SAVT PLC UE UENM IL UE UENM IL UE UENM IL Fig. 9. Experimental results for arbitrary decisions using Complete Linkage(CL) with Jaccard(J), Jaccard-NM(JNM) and Russell & Rao(RR) Fig. 8. Average Number of arbitrary decisions using Complete Linkage(CL) with Jaccard(J), Jaccard-NM(JNM) and Russell & Rao(RR) C. Internal evaluation using arbitrary decisions Figure 8 presents the arbitrary decisions taken as a result of applying the Jaccard, Jaccard-NM and Russell & Rao measures throughout the clustering process for all test systems. We can see from Figure 9 that in first thirteen steps of the clustering process for PLP, in first quarter for SAVT and in first half for PLC, the Jaccard similarity measure results in more arbitrary decisions as compared to Jaccard-NM and Russell & Rao. This is due to entities which have Jaccard similarity value equal to 1, while the value of a is different and these create large number of arbitrary decisions. This is case1 which we have defined, and for which we proposed Jaccard-NM. In this case Russell & Rao also gives better results. It can be seen from Figure 9 that for PLP, number of arbitrary decisions by Russell & Rao is higher as compared to Jaccard and Jaccard-NM. It can be seen also that for SAVT and PLC behavior of Jaccard-NM and Russell & Rao is almost same. This difference in PLP data set is due to the case3 defined in Section IV-A. The average arbitrary decisions for Unbiased Ellenberg, Unbiased Ellenberg-NM and Information Loss measure are presented in Figure 10. It was expected that the number of arbitrary decisions by Unbiased Ellenberg-NM would be less than for other similarity measures. The experimental results confirm our expectations. We can see that Information Loss results in less arbitrary decisions while Unbiased Ellenberg results in more [1]. Moreover, our new measure Unbiased Ellenberg-NM results in less arbitrary decisions as compared to Information Loss, thus producing the best clustering results. Thus from our analysis and experimental results we con- Fig. 10. Average number of arbitrary decisions using Weighted Combined(WC) with Unbiased Ellenberg(UE) and Unbiased Ellenberg- NM(UENM) and Limbo using Information Loss(IL) measure clude that: When feature vector has d = 0, then Jaccard-NM and Russell & Rao become equal to Jaccard measure. Russell & Rao depends on a only. Jaccard-NM and Russell & Rao produce better clustering results as compared to Jaccard by reducing arbitrary decisions. Unbiased Ellenberg-NM substantially decreases number of arbitrary decisions as compared to Unbiased Ellenberg and Information Loss for non-binary features producing significantly better clustering results. VII. CONCLUSIONS Various binary and non-binary similarity measures have been used during clustering for software architecture recovery. Each of the measures has its own characteristics. Previous research suggests that the similarity measures which do not consider absence of features d, perform well for software clustering and those that include d do not. Amongst the measures not containing d, Jaccard measure produces the best 54 53

10 results. In this paper, we analyzed the performance of the Jaccard measure (which does not contain d), and Jaccard-NM and Russell & Rao measures (which contain d) using various cases that may arise in the feature matrix of a software system. We identified deficiencies of the Jaccard measure and showed how Jaccard-NM and Russell & Rao give better results than Jaccard. This is because they use d not to determine similarity, but to determine proportion of common and total features. We also showed how Jaccard-NM is capable of reducing arbitrary decisions, which may be problematic during clustering process. We also defined the non-binary counterpart of Jaccard-NM, the Unbiased Ellenberg-NM and compared its performance with Unbiased Ellenberg and Information Loss measures. Similar to Jaccard-NM, it reduces arbitrary decisions and results in better clusters. In the future, it will be interesting to evaluate the performance of Jaccard-NM, Russell & Rao and Unbiased Ellenberg-NM measures on other systems ACKNOWLEDGMENT The authors would like to thanks Mr. Abdul Qudus Abbasi for providing the Software Test Systems. REFERENCES [1] O. Maqbool and H. A. Babri, Hierarchical clustering for software architecture recovery, IEEE Trans. Software Eng., vol. 33, no. 11, pp , November [2] P. Andritsos and V. Tzerpos, Information theoretic software clustering, IEEE Trans. Software Eng., vol. 31, no. 2, pp , February [3] C. Tjortjis, L. Sinos, and P. Layzell, Facilitating program comprehension by mining association rules from source code, Proc. Int l Workshop Program Comprehension, pp , May [4] P. Tonella, Concept analysis for module restructuring, IEEE Trans. software Eng., vol. 27, pp , Apr [5] M. Consens, A. Mendelzon, and A. Ryman, Visualizing and querying software structures, Proc. of the Intl. Conference on Software Engineering(ICSE), vol. 133, pp , May [6] N. Anquetil and T. C. Lethbridge, Experiments with clustering as a software remodularization method, Proc. Working Conference Reverse Engineering (WCRE), pp , [7] J. Davey and E. Burd, Evaluating the suitability of data clustering for software remodularization, Proc. Working Conf. Reverse Eng., pp , November [8] R. Naseem, O. Maqbool, and S. Muhammad, An improved similarity measure for binary features in software clustering, Proc. of the Int l. Conference on Computational Intelligence, Modelling and Simulation(CIMSim), pp , September [9] S.-S. Chot, S.-H. Cha, and C. C. Tappert, A survey of Binary similarity nd distance measures, Journal of Systemics, Cybernetics and Informatics, vol. 8, no. 1, pp , [10] N. Anquetil and T. Lethbridge, Comparative study of clustering algorithms and abstract representations for software remodularisation, Software, IEE Proceedings, vol. 150, no. 3, pp , [11] M. Saeed, O. Maqbool, H. A. Babri, S. Hassan, and S. Sarwar, Software clustering techniques and the use of combined algorithm, Proc. Int l Conf. Software Maintenance and Reeng., pp , March [12] O. Maqbool and H. A. Babri, The weighted combined algorithm: a linkage algorithm for software clustering, Proc. Int l Conf. Software Maintenance and Reeng., pp , [13] B. S. Mitchell and S. Mancoridis, On the automatic modularization of software systems using the bunch tool, IEEE Trans. Software Eng., vol. 32, no. 3, pp , March [14] S. Mancoridis, B. Mitchell, C. Rorres, Y. Chen, and E. R. Gansner, Using automatic clustering to produce high-level system organizations of source code, In Proc. 6th Intl. Workshop on Program Comprehension, pp , [15] S. Mancoridis, B. Mitchell, Y. Chen, and E. Gansner, Bunch: A clustering tool for the recovery and maintenance of software system structures, IEEE Int l. Conference on Software Maintenance, p. 50, [16] B. S. Mitchell and S. Mancoridis, Using heuristic search techniques to extract design abstractions from source code, Proceedings of the Genetic and Evolutionary Computation Conference, pp , [17] M. Harman, S. Swift, and K. Mahdavi, An empirical study of the robustness of two module clustering fitness functions, Proc. Genetic and Evolutionary Computation Conference, pp , June [18] P. Willett, Similarity-based approaches to virtual screening, Biochemical Society Transactions, vol. 31, no. 3, pp , Jun [19] S. Dalirsefat, A. da Silva Meyer, and S. Mirhoseini, Comparison of similarity coefficients used for cluster analysis with amplified fragment length polymorphism markers in the silkworm, Bombyx mori, Journal of Insect Science, vol. 71, pp. 1 8, [20] A. Q. Abbasi, Application of appropriate machine learning techniques for automatic modularization of software systems, MPhil. thesis, Quaide-Azam University Islamabad, [21] Z. Wen and V. Tzerpos, Evaluating similarity measures for software decompositions, Proc. Int l Conf. Software Maintenance, pp , September [22] N. Anquetil, C. Fourier, and T. C. Lethbridge, Experiments with hierarchical clustering algorithms as software remodularization methods, Proc. Working Conf. Reverse Eng., [23] Y. Kanellopoulos, P. Antonellis, C. Tjortjis, and C. Makris1, kattractors: A clustering algorithm for software measurement data analysis, In Proc. 19th IEEE Int l. Conference on Tools with Artificial Intelligence, pp , [24] A. Lakhotia, A unified framework for expressing software subsystem classification techniques, Journal of Systems and Software, vol. 36, pp , [25] Z. Wen and V. Tzerpos, An effectiveness measure for algorithms, Proc. Int l Workshop Program Comprehension, pp , June [26] M. Shtern and V. Tzerpos, A framework for the comparison of nested software decompositions, In Proc. of the 11th IEEE Working Conf. Reverse Engineering, pp ,

Measuring the Structural Similarity between Source Code Entities

Measuring the Structural Similarity between Source Code Entities Ricardo Terra, João Brunet, Luis Miranda, Marco Túlio Valente, Dalton Serey, Douglas Castilho, and Roberto Bigonha Universidade Federal