Improved Similarity Measures For Software Clustering

Size: px
Start display at page:

Download "Improved Similarity Measures For Software Clustering"

Transcription

1 th European Conference on Software Maintenance and Reengineering Improved Similarity Measures For Software Clustering Rashid Naseem, Onaiza Maqbool, Siraj Muhammad Dept. of Computer Science, Quaid-I-Azam University, Islamabad Elixir Technologies Pakistan (PVT) LTD Abstract Software clustering is a useful technique to recover architecture of a software system. The results of clustering depend upon choice of entities, features, similarity measures and clustering algorithms. Different similarity measures have been used for determining similarity between entities during the clustering process. In software architecture recovery domain the Jaccard and the Unbiased Ellenberg measures have shown better results than other measures for binary and non-binary features respectively. In this paper we analyze the Russell and Rao measure for binary features to show the conditions under which its performance is expected to be better than that of Jaccard. We also show how our proposed Jaccard-NM measure is suitable for software clustering and propose its counterpart for non-binary features. Experimental results indicate that our proposed Jaccard-NM measure and Russell & Rao measure perform better than Jaccard measure for binary features, while for non-binary features, the proposed Unbiased Ellenberg-NM measure produces results which are closer to the decomposition prepared by experts. Index Terms Software Clustering, Jaccard-NM Measure, Jaccard Measure, Unbiased Ellenberg-NM Measure, Russell & Rao Measure I. INTRODUCTION Software clustering has engaged the interest of researchers in the last two decades, primarily as a technique to facilitate understanding of legacy software systems. When the architectural documentation is not available, or the documentation has not been updated to reflect changes in the software over time, software clustering may be used for software modularization and architecture recovery [1], [2]. Besides clustering, other techniques used for this purpose are association rule mining [3], concept analysis [4] and graphical visualization [5]. The clustering process is used to modularize a software system or to recover sub-systems by grouping together software entities that are similar to each other. Thus entities within a cluster have similar characteristics or features, and are dis-similar from entities in other clusters. To determine similarity based on features of an entity, a similarity measure is employed. Many different similarity measures are available. The choice of a measure depends on the characteristics of the domain in which they are applied. In the software domain, the most commonly used similarity measure for hierarchical clustering is Jaccard coefficient for binary features [6], [7], while for non-binary features Unbiased Ellenberg and Information Loss measure have been shown to produce better results as compared to other measures [1]. In this paper, we describe our proposed Jaccard-NM [8] measure for binary features, and compare it with the Jaccard and Russell & Rao [9] measures. We present different cases to show deficiencies in the Jaccard measure which may deteriorate clustering results, and show how in these cases the Russell & Rao and Jaccard-NM measures are expected to have better performance. For non-binary features we propose the Unbiased Ellenberg-NM measure, and compare its performance with the Unbiased Ellenberg and Information Loss measures. We also analyze cases where these measures produce arbitrary decisions. We call a decision arbitrary when more than two entities have equal similarity value. In this situation, clustering algorithms select two entities to be clustered arbitrarily. This arbitrary decision may create problems [1]. Thus the contributions of this paper can be summarized as: 1) Analysis of the Jaccard, Jaccard-NM and Russell & Rao measures for binary features and a comparison of their strengths and weaknesses. 2) Definition of a new similarity measure for non-binary features and its comparison with well known existing measures used for software clustering. 3) Internal and external evaluation of clustering results. Internal assessment is carried out using arbitrary decisions taken by proposed and existing similarity measures. External assessment is carried out by comparing manually prepared software decompositions with automatically produced clustering results using MoJoFM. This paper is organized as follows. In Section 2 we describe related work. An overview of clustering is presented in Section 3. In Section 4 we present an analysis of similarity measures for binary features and define a new measure for non-binary features. Section 5 describes our experimental setup. In Section 6, we analyze our experimental results. Finally in Section 7, we present the conclusions and future work. II. RELATED WORK To find similarity between entities, various similarity measures have been used. Davey and Burd evaluated different similarity measures including Jaccard, Sorensen-Dice, Canberra and Correlation coefficients [7]. From experimental results they concluded that Jaccard and Sorensen-Dice similarity measures perform identically and they recommended Jaccard similarity measure for software clustering when features are binary /11 $ IEEE DOI /CSMR

2 Anquetil and Lethbridge compared different similarity measures including Jaccard, Simple Matching, Sorensen-Dice, Correlation, Taxonomic and Canberra [6]. For clustering they used Complete linkage, Weighted linkage, Unweighted linkage and Single linkage algorithms. They concluded that Jaccard and Sorensen-Dice similarity measures produce good results because they do not consider absence of a feature (d) as a sign of similarity, while Simple Matching and other similarity measures consider absence of a feature to be a sign of similarity and thus do not produce satisfactory results. In 2003 Anquetil and Lethbridge evaluated different features, similarity measures and clustering algorithms. From experimental results they once again concluded that Jaccard similarity measure produces good results [10]. Saeed et al. developed a new linkage algorithm called Combined algorithm [11]. They compared this algorithm with Complete linkage using different similarity measures including Jaccard, Sorensen-Dice, Simple Matching and Correlation coefficient. They concluded that behavior of Correlation coefficient is similar to the Jaccard similarity measure when the number of absent feature is very large as compared to present features. In 2004, Maqbool and Babri developed the Weighted Combined algorithm, and proposed the Unbiased Ellenberg similarity measure [12]. In this paper they evaluated Complete linkage, Combined algorithm and Weighted Combined algorithms using Jaccard, Euclidean distance, Pearson correlation coefficient, Ellenberg and the Unbiased Ellenberg similarity measures. Their results suggested that Weighted Combined algorithm produces better results than Complete and Combined algorithms especially with Unbiased Ellenberg measure. Andritsos and Tzerpos developed an algorithm called LIMBO (scalable InforMation BOtleneck algorithm) in 2005 [2]. They applied LIMBO to three different data sets and compared the results with ACDC, NAHC-lib, SAHC, SAHClib, Single Linkage, Complete Linkage, Weighted Average Linkage and Unweighted Average Linkage algorithms. They concluded that, on an average, LIMBO performed better than other algorithms. In 2006, Mitchell and Mancoridis described their Bunch clustering tool which uses search techniques (hill-climbing and genetic algorithms) to find optimal solutions [13]. Bunch tool was developed in 1998 [14] and over time modified to include new features (e.g omnipresent modules detection and deletion) [15], [16]. Bunch tool uses Module Dependency Graph (MDG), where modules are entities and edges are static relationships among entities. Bunch makes partitions of MDG and uses a fitness function Modularization Quality (MQ), to calculate the quality of graph partitions. Harman et al. investigated the effect of noise in input information available for software module clustering [17]. To guide the search they examine two fitness functions: Modularization Quality (MQ) and Evaluation Metric function (EVM). For evaluation purpose they used six real software systems, three Perfect module dependency graphs and three Random module dependency graphs and concluded that in the presence of noise EVM performs better than MQ for real and perfect MDGs. Results also show that EVM is more robust than MQ for smaller software systems. In 2010 Naseem et al. proposed a new similarity measure called Jaccard-NM [8] for binary features. They evaluated this measure using Complete linkage, Weighted average and Unweighted average. From the experimental results they concluded that, in general, Jaccard-NM produces better results than Jaccard similarity measure for binary features. Besides the software domain, a comparison of similarity measures has also been carried out in other domains. In these domains, the Jaccard measure does not necessarily perform better than other measures as in the case of software due to different domain characteristics. For example, Willett used thirteen similarity measures including Tanimoto, Russell & Rao, and Simple matching to find the similarity between molecular fingerprints for virtual screening [18]. He concluded that Tanimoto, Baroni-Urbani/Buser, Kulczynski(2), Fossum and Ochiai/Cosine coefficients perform reasonably well across the range of molecular size [18]. Dalirsefat1a et al. compared three similarity measures including Jaccard, Sorensen-Dice and Simple Matching to find the similarity between biological organisms. They concluded from their experimental results that when the organisms are closely related then Jaccard or Sorensen-Dice give satisfactory results [19]. Moreover Jaccard and Sorensen-Dice produce closely similar results because these two measures exclude negative co-occurrences. These results are similar to those obtained for software. III. OVERVIEW OF CLUSTERING In the clustering process, entities are grouped together based on their features. In this section, we provide an overview of the steps in clustering. A. Selection of Entities and Features Selection of entities and features depend on type of software system, and also on the required architectural view. For modularization of structured software systems, researchers have selected different entities e.g. files, processes and functions. Features may be global variables or user defined types used by an entity [6]. For object oriented software systems, entities may be classes [20] and features are typically defined by the relationships between classes e.g. inheritance or containment. In the software domain, features are usually binary i.e. they indicate the presence or absence of a characteristic or relationship. Before applying a clustering algorithm, a software system must be parsed to extract entities and features. The result is an NxP matrix, where N is the number of entities and P is the number of features. Table I presents an NxP matrix of a small software system containing 4 entities and 6 binary features. B. Selection of Similarity Metrics In the second step, a similarity measure is applied to compute similarity between every pair of entities, resulting in a similarity matrix. Selection of similarity measure should 47 46

3 TABLE I (N X P) FEATURE MATRIX FOR A SMALL SYSTEM TABLE IV SIMILARITY MEASURES FOR NON-BINARY FEATURES f1 f2 f3 f4 f5 f E E E be done carefully, because selecting an appropriate similarity measure may influence clustering results more than selection of a clustering algorithm [21]. Table II lists some well known similarity measures for binary features. TABLE II SIMILARITY MEASURES FOR BINARY FEATURES S. Name Mathematical representation No 1 Jaccard a/(a + b + c) 2 Russell & Rao a/(a + b + c + d) 3 Simple Matching (a + d)/(a + b + c + d) 4 Sokal Sneeth a/(a + 2(b + c)) 5 Rogers-Tanimoto (a + d)/a + 2(b + c) + d) 6 Gower-Legendre (a + d)/(a + 0.5(b + c) + d) The Jaccard-NM measure for binary features proposed by us in [8] is given by: a Jaccard NM = (1) 2(a + b + c) + d In Table II and Equation 1, a, b, c and d can be determined using Table III. For two entities X and Y, a is the number of features that are present 1 in both entities X and Y, b represents features that are present in X but absent in Y, c represents features that are not present in X and present in Y, and d represents the number of features that are absent 0 in both entities. n = a + b + c + d is the total number of features. X TABLE III CONTINGENCY TABLE Y 1 (Presence) 0 (Absence) Sum 1 (Presence) a b a+b 0 (Absence) c d c + d Sum a + c b + d n = a + b + c + d Table IV lists some well known similarity measures for non-binary features. In Table IV, since the features are nonbinary, Ma represents the sum of features that are present in both entities X and Y, Mb represents sum of features that are present in X but absent in Y and Mc represents sum of features that are not present in X and are present in Y. In the software domain, it has been shown that Jaccard measure produces better results than other measures for binary features [6], [7]. One reason for this is that it does not consider d (absence of feature/negative match) [11], [22]. It S. Name Mathematical representation No 1 Ellenberg 0.5 Ma/(0.5 Ma + Mb + Mc) 2 Unbiased Ellenberg 0.5 Ma/(0.5 Ma + b + c) 3 Gleason Measure Ma/(Ma + Mb + Mc) 4 Unbiased Gleason Ma/(Ma + b + c) measure has been observed that in software clustering, the features are asymmetric, i.e. the presence of a feature 1 has more weight than its absence 0. The absence of features does not indicate similarity between two entities e.g. if two classes both do not use a variable, it does not mean that they are similar. For nonbinary features the counter part of Jaccard similarity measure Unbiased Ellenberg produces better results for software clustering [1], [12]. C. Application of a Clustering Algorithm The next step is to apply a clustering algorithm, which can be categorized into hierarchical or non-hierarchical. Agglomerative Hierarchical Clustering (AHC) algorithms are based on the bottom-up approach. In this approach, an algorithm considers entities as singleton clusters, and at every step clusters the two most similar entities together. At the end, the algorithm makes one large cluster which contains all entities. Although in the software domain, non-hierarchical algorithms have also been used [23], [24], but there are some advantages of using AHC algorithms. For example, there is no need of prior information about number of clusters. Moreover, the hierarchical structure of a software system is naturally represented through hierarchical algorithms. But the disadvantage is that we have to select a cutoff point, which represents the number of steps after which to stop the algorithm. Widely used agglomerative hierarchical algorithms for software architecture recovery are Complete Linkage (CL), Single Linkage (SL), Weighted Average Linkage (WAL) and Unweighted Average Linkage (UWAL). When two entities are merged into a cluster, similarity between the newly formed cluster and other clusters/entities is calculated differently by these algorithms. Suppose we have three entities E 1, E 2 and E 3. Using these algorithms, similarity between E 1 and newly formed cluster E 23 is calculated as [22]: Complete Linkage Similarity(E 1, E 23 ) = min(similarity(e 1, E 2 ), Similarity(E 1, E 3 )). Single Linkage Similarity(E 1, E 23 ) = max(similarity(e 1, E 2 ), Similarity(E 1, E 3 )). Weighted Average Linkage Similarity(E 1, E 23 ) = (1/2 Similarity(E 1, E 2 ) + 1/2 Similarity(E 1, E 3 )). Unweighted Average Linkage Similarity(E 1, E 23 ) = (Similarity(E 1, E 2 ) 48 47

4 size(e 2 )+Similarity(E 1, E 3 ) size(e 3 ))/(size(e 2 )+ size(e 3 ). The Complete linkage algorithm supports formation of small but cohesive clusters, while the Single linkage algorithm makes large non-cohesive but stable clusters. The results of Weighted and Unweighted Average Linkage algorithms lie between these two. Two recently proposed hierarchical algorithms for software clustering are Weighted Combined Algorithm (WCA) [12] and LIMBO [2]. When two entities are merged in a cluster, information about the number of entities accessing a feature is lost [12] when using linkage algorithms. WCA and LIMBO overcome this limitation of linkage algorithms by making a new feature vector for the newly formed cluster. This feature vector contains information about number of entities accessing a feature. Unlike linkage algorithms, these algorithms update feature matrix after every step. Suppose we have two entities E 1 and E 2 with normalized feature vectors f i and f j, respectively. The newly feature vector f ij is calculated for both algorithms as: f ij = (f i + f j ) /(n i + n j ) = (f ik + f jk ) /(n i + n j ), k=1,2,...,p Information Loss (IL) measure is used with LIMBO to calculate the information loss between any two entities/clusters. The entities are chosen for grouping together into a new cluster when their IL is minimum. The IL represented by δi, is briefly described below (For detail and examples see [2]). Information loss is given as: δi=[p(e i) + p(e j)]*d js[f i, f j] For each singleton entity, p(e i ) = p(e j ) = 1/n, where n is the total number of entities. D js is the Jensen-Shannon divergence, defined as follows: D js = p(e i)/p(e ij)*dkl[f i f ij] + p(e j)/p(e ij)*dkl[f j f ij] Dkl is the relative entropy (also called Kullback-Leibler (KL) divergence), which is the difference between two probability distributions, given as: Dkl[f i f j] = p k=1 fik log ( ) f ik / f jk D. Evaluation of Results In external assessment, the automatically prepared decompositions are compared with the decompositions prepared by human experts. For this purpose different measures may be used. A well known measure is MoJoFM [25], a recent version of MoJo [26]. MoJoFM is an external assessment measure which calculates the percentage of Move and Join operations to convert the decomposition produced by a clustering algorithm to an expert decomposition [25]. To compare the result A of our algorithm with expert decomposition B, we have: MoJoF M(M) = ( 1 mno(a, B) max( mno(a, B)) ) 100 (2) where mno(a, B) is the minimum number of move and join operations needed to convert from A to B and max( mno(a, B)) is the minimum number of possible move and join operations needed to convert from A to B. A higher MoJoFM (100%) value denotes greater correspondence between the two decompositions and hence better results while lower MoJoFM (0%) values indicate that decompositions are very different. In internal assessment, some internal characteristic of clusters may be used to evaluate quality of results. Arbitrary decisions represent an internal quality measure [1]. Arbitrary decision is taken by an algorithm when there are more than one maximum values for similarity between entities (or for distance and information loss measures, there are more than one minimum values). IV. AN ANALYSIS OF SIMILARITY MEASURES AND FEATURE VECTOR CASES In this section, we analyze similarity measures for binary features and propose a new measure for non-binary features. A. Analysis of similarity measures As described in Section III-B, for software clustering, measures that do not contain d produce better results. This is because features in software are asymmetric, and a 1 and a 0 do not have equal weight. 0 indicates the absence of a feature, and hence d indicates that features are not being shared between entities. For software, the absence of a feature in two entities does not indicate similarity. For example, if two classes do not access the same global function, it does not mean that the two classes are similar. To show that the presence of d in a measure does not necessarily deteriorate results, consider Table V which shows 4 entities, E1-E4. E1 and E2 share two features, so that value of a is 2. Both of them access one feature each that the other entity does not, so b = 1 and c = 1. E3 and E4 share three features, so a = 3. Similar to E1 and E2, both of them access one feature each that the other entity does not, so b = 1 and c = 1, as given in Figure 1. TABLE V SOFTWARE SYSTEM A Entities f1 f2 f3 f4 f5 f6 f7 E E E E TABLE VI SIMILARITY MATRIX USING JACCARD FOR SOFTWARE SYSTEM A E E E

5 features. Consider the following two cases which indicate how the presence of d in Jaccard-NM and Russell & Rao measure may improve performance as compared to Jaccard. - Case1: Value of a is different among entities, but similarity as per Jaccard is same. TABLE X SOFTWARE SYSTEM B Fig. 1. Relationships between entities in software system A TABLE VII SIMILARITY MATRIX USING JACCARD-NM FOR SOFTWARE SYSTEM A Entities f1 f2 f3 f4 E E E E E E E TABLE VIII SIMILARITY MATRIX USING RUSSELL & RAO FOR SOFTWARE SYSTEM A E E E TABLE IX SIMILARITY MATRIX USING SIMPLE MATCHING FOR SOFTWARE SYSTEM A E E E The similarity matrix according to the Jaccard measure is given in Table VI. The similarity matrices according to the Jaccard-NM, Russell & Rao and Simple Matching measures (all of which contain d) are given in Table VII - Table IX. It can be seen from Table VI - Table VIII that Jaccard, Jaccard- NM and Russell & Rao measures find E3 and E4 to be most similar. From Figure 1, it is clear that E3 and E4 should indeed be considered most similar. However, due to presence of d in numerator of Simple Matching coefficient, it finds E1 & E2 and E3 & E4 to be equally similar, resulting in an arbitrary decision where either of these entities may be grouped. From this example, it is clear that the significant factor here is whether d is present in numerator or denominator of a measure. Its presence in the numerator deteriorates results (as for Simple Matching Coefficient). However, if it is present in denominator only, it does not indicate similarity but it is a useful indicator of the proportion of common and total Fig. 2. Relationships between entities in software system B An example feature matrix with 4 entities (E1-E4) and 4 features (f1-f4) of a software system B for this case is presented in Table V and shown in Figure 2. In this system value of a is 2 for entities E1 and E2. For entities E3 and E4, value of a is 4. The corresponding similarity matrices using Jaccard, Jaccard-NM and Russell & Rao measures are given in Table XI - Table XIII. It can be seen from Table XI that using the Jaccard measure, both E1 and E2, and E3 and E4 are found to be equally similar. It may be better to choose E3 and E4 for clustering rather than E1 and E2 as they share a larger number of features. Both Jaccard-NM and Russell & Rao find E3 and E4 to be more similar so an arbitrary decision is reduced. TABLE XI SIMILARITY MATRIX USING JACCARD FOR SOFTWARE SYSTEM B E2 1 0 E E TABLE XII SIMILARITY MATRIX USING JACCARD-NM FOR SOFTWARE SYSTEM B E E E

6 TABLE XIII SIMILARITY MATRIX USING RUSSELL & RAO FOR SOFTWARE SYSTEM B TABLE XVI SIMILARITY MATRIX USING JACCARD-NM FOR SOFTWARE SYSTEM C E E E E E E Case2: Value of a is high among entities, but they are not completely similar. An example feature matrix with 4 entities (E1-E4) and 9 features (f1-f9) of a software system C for this case is presented in Table XIV and Figure 3. The corresponding similarity matrices using Jaccard measure, Jaccard-NM and Russell & Rao measure are given in Table XV - Table XVII. It can be seen that entities E1 and E2 are found to be most similar by Jaccard. However, Jaccard-NM and Russell & Rao find E3 and E4 to be most similar, which may be more appropriate. TABLE XIV SOFTWARE SYSTEM C Entities f1 f2 f3 f4 f5 f6 E E E E TABLE XVII SIMILARITY MATRIX USING RUSSELL & RAO FOR SOFTWARE SYSTEM C E E E when Russell & Rao already exists. To answer this question, consider the following example: - Case3: Value of a is same but values of b and c is not. Consider Table XVIII and Figure 4 having four entities (E1- E4) and five features (f1-f5). All the entities have same value of a equal to three but entities E1 and E2 have b and c = 0 while E3 and E4 have b = 1 and c = 1. TABLE XVIII SOFTWARE SYSTEM D Entities f1 f2 f3 f4 f5 E E E E Fig. 3. Relationships between entities in software system C TABLE XV SIMILARITY MATRIX USING JACCARD FOR SOFTWARE SYSTEM C Fig. 4. Relationships between entities in software system D E2 1 0 E E Through case1 and case2, we have shown that both the Jaccard-NM and Russell & Rao measures are expected to provide better results as compared to the Jaccard measure. The question arises as to why we need to define Jaccard-NM The corresponding similarity matrices using Russell & Rao and Jaccard-NM measures are given in Table XIX and Table XX respectively. It can be seen from Table XIX that Russell & Rao results in arbitrary decisions among all entities. But it can be seen from Table XX that Jaccard-NM reduces arbitrary decisions and gives preference to E1 and E2 to form a cluster in first step. Hence in certain cases, the results of Jaccard-NM and Russell & Rao are different, with Jaccard-NM reducing the arbitrary decisions which have a negative impact on the clustering results

7 TABLE XIX SIMILARITY MATRIX USING RUSSELL & RAO FOR SOFTWARE SYSTEM D TABLE XXI BRIEF DESCRIPTION OF DATA SETS E E E TABLE XX SIMILARITY MATRIX USING JACCARD-NM FOR SOFTWARE SYSTEM D S. No. PLP SAVT PLC 1 Total number source code lines 2 Total number of header (.h) files 3 Total number of implementation (.cpp,.cxx) files 4 Total number of Classes E E E B. Unbiased Ellenberg-NM - A new similarity measure for non-binary features Unbiased Ellenberg is a Jaccard like measure but for nonbinary features as given in equation 3. U nbiasedellenberg = 0.5 Ma 0.5 Ma + b + c The cases discussed in Section IV-A can also occur in nonbinary features matrix. Therefore to solve these problems, we propose a new measure called Unbiased Ellenberg-NM. Our new measure is defined as follows. UnbiasedEllenberg NM = = (3) 0.5 Ma 0.5 Ma + b + c + n (4) 0.5 Ma 0.5 Ma + b + c + (a + b + c + d) = 0.5 Ma 0.5 Ma + 2(b + c) + a + d) V. EXPERIMENTAL SETUP In this section, we describe the test systems and clustering setup for our experiments. A. The Test Systems To conduct clustering experiments, we selected three object oriented software systems which have been developed in Visual C++ [20]. These are proprietary software systems that run under Windows platforms. Statistical Analysis Visualization Tool (SAVT) is an application which provides functionality related to statistical data and result visualization. Printer Language Converter (PLC) is a part of another system, which provides conversion of intermediate data structures to printer language. Print Language Parser (PLP) is a parser of a well known printer language. It transforms plain text and stores output in intermediate data structures. A brief description is given in Table XXI. (5) (6) B. Entities and Features Since all systems are object-oriented, we selected class as an entity. From different relationships that exist between classes, we selected eleven sibling (indirect) relationships [20] listed in Table XXII, since the similarity measures listed in Table II can only be applied to indirect relationships. We used these relationships because they occur frequently within object-oriented systems. C. Similarity Measures To find out similarity between entities having binary features we selected the Jaccard, Jaccard-NM and Russell & Rao similarity measures. For non-binary features we selected Unbiased Ellenberg and Information Loss measures and compared their results with our new proposed measure Unbiased Ellenberg- NM. D. Algorithms To cluster the most similar entities we selected agglomerative clustering algorithms including Complete linkage, Weighted average and Unweighted average described in Section III-C. We also selected Weighted Combined Algorithm [12] and LIMBO [2]. E. Assessment We obtained expert decompositions for each test system and compared our automatically produced clustering results with the expert decompositions at each step of hierarchical clustering using the MoJoFM [25]. Results are reported by selecting the maximum MoJoFM value obtained during the clustering process. For internal assessment, the results obtained by measures were evaluated internally by number of arbitrary decisions taken during clustering process. VI. EXPERIMENTAL RESULTS AND ANALYSIS A. External evaluation of results for binary features In this section, we present experimental results of Complete Linkage (CL), Weighted Average Linkage (WAL) and Unweighted Average Linkage (UWAL) algorithms using Jaccard (J), Jaccard-NM (JNM) and Russell & Rao (RR) similarity measures. Table XXIII and Figure 5 present the results of the comparison between automatically obtained decomposition and expert decomposition using MoJoFM. From Figure 5 one can see that 52 51

8 TABLE XXII INDIRECT RELATIONSHIPS BETWEEN CLASSES THAT WERE USED FOR EXPERIMENTS Name Same Inheritance Hierarchy Same Class Containment Same Class in Methods Same Generic Class Same Generic Parameter Same File Same Folder Same Global Function Access Same Macro Access Same Global Variable Access Description Two or more classes that are derived from same class Represents that classes contain objects of same class Represents classes containing objects of same class declared in a method locally or as parameter Represents that two classes are used as instantiating parameters to same generic class The relationship between two generic classes which have same class as their parameter The source code of two or more classes is written in same file Two or more classes reside in same folder Two or more than two classes access same global functions Two or more than two classes access same macro Two or more than two classes access same global variable in all data sets Jaccard-NM, and Russell & Rao give results equal to or better than Jaccard for all algorithms. From Table XXIII and Figure 6 it can be seen that on an average, Jaccard- NM and Russell & Rao produce significantly better results than the Jaccard similarity measure for all linkage algorithms. Fig. 6. Average MoJoFM using Jaccard(J), Jaccard-NM(JNM) and Russell & Rao(RR) Fig. 5. Experimental results using MoJoFM values for Complete(CL),Unweighted Average(UWAL) and Weighted Average(WAL) using Jaccard(J), Jaccard-NM(JNM) and Russell & Rao(RR) similarity measures Ellenberg-NM measures and Limbo using Information Loss measure. Figure 7 indicates that Unbiased Ellenberg-NM gives better results as compared to Unbiased Ellenberg and Information Loss measures. We analyze the reason for the better results of Unbiased Ellenberg-NM in the next section. TABLE XXIII MOJOFM VALUES OF JACCARD, JACCARD-NM AND RUSSELL & RAO MEASURES FOR ALL DATA SETS AND LINKAGE ALGORITHMS PLP SAVT PLC J JNM RR J JNM RR J JNM RR CL UWAL WAL Average B. External evaluation of results for non-binary features Figure 7 and Table XXIV show results of applying Weighted Combined algorithm using Unbiased Ellenberg and Unbiased Fig. 7. MoJoFM results for Weighted Combined(WC) using Unbiased Ellenberg(UE) and Unbiased Ellenberg-NM(UENM) measures and Information Loss Measure(IL) measures 53 52

9 TABLE XXIV EXPERIMENTAL RESULTS USING MOJOFM VALUES FOR UNBIASED ELLENBERG (UE) AND UNBIASED ELLENBERG-NM (UENM) USING WEIGHTED COMBINED ALGORITHM AND LIMBO USING INFORMATION LOSS MEASURE FOR ALL DATA SETS PLP SAVT PLC UE UENM IL UE UENM IL UE UENM IL Fig. 9. Experimental results for arbitrary decisions using Complete Linkage(CL) with Jaccard(J), Jaccard-NM(JNM) and Russell & Rao(RR) Fig. 8. Average Number of arbitrary decisions using Complete Linkage(CL) with Jaccard(J), Jaccard-NM(JNM) and Russell & Rao(RR) C. Internal evaluation using arbitrary decisions Figure 8 presents the arbitrary decisions taken as a result of applying the Jaccard, Jaccard-NM and Russell & Rao measures throughout the clustering process for all test systems. We can see from Figure 9 that in first thirteen steps of the clustering process for PLP, in first quarter for SAVT and in first half for PLC, the Jaccard similarity measure results in more arbitrary decisions as compared to Jaccard-NM and Russell & Rao. This is due to entities which have Jaccard similarity value equal to 1, while the value of a is different and these create large number of arbitrary decisions. This is case1 which we have defined, and for which we proposed Jaccard-NM. In this case Russell & Rao also gives better results. It can be seen from Figure 9 that for PLP, number of arbitrary decisions by Russell & Rao is higher as compared to Jaccard and Jaccard-NM. It can be seen also that for SAVT and PLC behavior of Jaccard-NM and Russell & Rao is almost same. This difference in PLP data set is due to the case3 defined in Section IV-A. The average arbitrary decisions for Unbiased Ellenberg, Unbiased Ellenberg-NM and Information Loss measure are presented in Figure 10. It was expected that the number of arbitrary decisions by Unbiased Ellenberg-NM would be less than for other similarity measures. The experimental results confirm our expectations. We can see that Information Loss results in less arbitrary decisions while Unbiased Ellenberg results in more [1]. Moreover, our new measure Unbiased Ellenberg-NM results in less arbitrary decisions as compared to Information Loss, thus producing the best clustering results. Thus from our analysis and experimental results we con- Fig. 10. Average number of arbitrary decisions using Weighted Combined(WC) with Unbiased Ellenberg(UE) and Unbiased Ellenberg- NM(UENM) and Limbo using Information Loss(IL) measure clude that: When feature vector has d = 0, then Jaccard-NM and Russell & Rao become equal to Jaccard measure. Russell & Rao depends on a only. Jaccard-NM and Russell & Rao produce better clustering results as compared to Jaccard by reducing arbitrary decisions. Unbiased Ellenberg-NM substantially decreases number of arbitrary decisions as compared to Unbiased Ellenberg and Information Loss for non-binary features producing significantly better clustering results. VII. CONCLUSIONS Various binary and non-binary similarity measures have been used during clustering for software architecture recovery. Each of the measures has its own characteristics. Previous research suggests that the similarity measures which do not consider absence of features d, perform well for software clustering and those that include d do not. Amongst the measures not containing d, Jaccard measure produces the best 54 53

10 results. In this paper, we analyzed the performance of the Jaccard measure (which does not contain d), and Jaccard-NM and Russell & Rao measures (which contain d) using various cases that may arise in the feature matrix of a software system. We identified deficiencies of the Jaccard measure and showed how Jaccard-NM and Russell & Rao give better results than Jaccard. This is because they use d not to determine similarity, but to determine proportion of common and total features. We also showed how Jaccard-NM is capable of reducing arbitrary decisions, which may be problematic during clustering process. We also defined the non-binary counterpart of Jaccard-NM, the Unbiased Ellenberg-NM and compared its performance with Unbiased Ellenberg and Information Loss measures. Similar to Jaccard-NM, it reduces arbitrary decisions and results in better clusters. In the future, it will be interesting to evaluate the performance of Jaccard-NM, Russell & Rao and Unbiased Ellenberg-NM measures on other systems ACKNOWLEDGMENT The authors would like to thanks Mr. Abdul Qudus Abbasi for providing the Software Test Systems. REFERENCES [1] O. Maqbool and H. A. Babri, Hierarchical clustering for software architecture recovery, IEEE Trans. Software Eng., vol. 33, no. 11, pp , November [2] P. Andritsos and V. Tzerpos, Information theoretic software clustering, IEEE Trans. Software Eng., vol. 31, no. 2, pp , February [3] C. Tjortjis, L. Sinos, and P. Layzell, Facilitating program comprehension by mining association rules from source code, Proc. Int l Workshop Program Comprehension, pp , May [4] P. Tonella, Concept analysis for module restructuring, IEEE Trans. software Eng., vol. 27, pp , Apr [5] M. Consens, A. Mendelzon, and A. Ryman, Visualizing and querying software structures, Proc. of the Intl. Conference on Software Engineering(ICSE), vol. 133, pp , May [6] N. Anquetil and T. C. Lethbridge, Experiments with clustering as a software remodularization method, Proc. Working Conference Reverse Engineering (WCRE), pp , [7] J. Davey and E. Burd, Evaluating the suitability of data clustering for software remodularization, Proc. Working Conf. Reverse Eng., pp , November [8] R. Naseem, O. Maqbool, and S. Muhammad, An improved similarity measure for binary features in software clustering, Proc. of the Int l. Conference on Computational Intelligence, Modelling and Simulation(CIMSim), pp , September [9] S.-S. Chot, S.-H. Cha, and C. C. Tappert, A survey of Binary similarity nd distance measures, Journal of Systemics, Cybernetics and Informatics, vol. 8, no. 1, pp , [10] N. Anquetil and T. Lethbridge, Comparative study of clustering algorithms and abstract representations for software remodularisation, Software, IEE Proceedings, vol. 150, no. 3, pp , [11] M. Saeed, O. Maqbool, H. A. Babri, S. Hassan, and S. Sarwar, Software clustering techniques and the use of combined algorithm, Proc. Int l Conf. Software Maintenance and Reeng., pp , March [12] O. Maqbool and H. A. Babri, The weighted combined algorithm: a linkage algorithm for software clustering, Proc. Int l Conf. Software Maintenance and Reeng., pp , [13] B. S. Mitchell and S. Mancoridis, On the automatic modularization of software systems using the bunch tool, IEEE Trans. Software Eng., vol. 32, no. 3, pp , March [14] S. Mancoridis, B. Mitchell, C. Rorres, Y. Chen, and E. R. Gansner, Using automatic clustering to produce high-level system organizations of source code, In Proc. 6th Intl. Workshop on Program Comprehension, pp , [15] S. Mancoridis, B. Mitchell, Y. Chen, and E. Gansner, Bunch: A clustering tool for the recovery and maintenance of software system structures, IEEE Int l. Conference on Software Maintenance, p. 50, [16] B. S. Mitchell and S. Mancoridis, Using heuristic search techniques to extract design abstractions from source code, Proceedings of the Genetic and Evolutionary Computation Conference, pp , [17] M. Harman, S. Swift, and K. Mahdavi, An empirical study of the robustness of two module clustering fitness functions, Proc. Genetic and Evolutionary Computation Conference, pp , June [18] P. Willett, Similarity-based approaches to virtual screening, Biochemical Society Transactions, vol. 31, no. 3, pp , Jun [19] S. Dalirsefat, A. da Silva Meyer, and S. Mirhoseini, Comparison of similarity coefficients used for cluster analysis with amplified fragment length polymorphism markers in the silkworm, Bombyx mori, Journal of Insect Science, vol. 71, pp. 1 8, [20] A. Q. Abbasi, Application of appropriate machine learning techniques for automatic modularization of software systems, MPhil. thesis, Quaide-Azam University Islamabad, [21] Z. Wen and V. Tzerpos, Evaluating similarity measures for software decompositions, Proc. Int l Conf. Software Maintenance, pp , September [22] N. Anquetil, C. Fourier, and T. C. Lethbridge, Experiments with hierarchical clustering algorithms as software remodularization methods, Proc. Working Conf. Reverse Eng., [23] Y. Kanellopoulos, P. Antonellis, C. Tjortjis, and C. Makris1, kattractors: A clustering algorithm for software measurement data analysis, In Proc. 19th IEEE Int l. Conference on Tools with Artificial Intelligence, pp , [24] A. Lakhotia, A unified framework for expressing software subsystem classification techniques, Journal of Systems and Software, vol. 36, pp , [25] Z. Wen and V. Tzerpos, An effectiveness measure for algorithms, Proc. Int l Workshop Program Comprehension, pp , June [26] M. Shtern and V. Tzerpos, A framework for the comparison of nested software decompositions, In Proc. of the 11th IEEE Working Conf. Reverse Engineering, pp ,

Measuring the Structural Similarity between Source Code Entities

Measuring the Structural Similarity between Source Code Entities Measuring the Structural Similarity between Source Code Entities Ricardo Terra, João Brunet, Luis Miranda, Marco Túlio Valente, Dalton Serey, Douglas Castilho, and Roberto Bigonha Universidade Federal

More information

An Empirical Study on the Developers Perception of Software Coupling

An Empirical Study on the Developers Perception of Software Coupling An Empirical Study on the Developers Perception of Software Coupling Gabriele Bavota Bogdan Dit Rocco Oliveto Massimiliano Denys Di Penta Poshyvanyk Andrea De Lucia Coupling Module 1 Module 2 Coupling

More information

Summer Review Packet. for students entering. IB Math SL

Summer Review Packet. for students entering. IB Math SL Summer Review Packet for students entering IB Math SL The problems in this packet are designed to help you review topics that are important to your success in IB Math SL. Please attempt the problems on

More information

has its own advantages and drawbacks, depending on the questions facing the drug discovery.

has its own advantages and drawbacks, depending on the questions facing the drug discovery. 2013 First International Conference on Artificial Intelligence, Modelling & Simulation Comparison of Similarity Coefficients for Chemical Database Retrieval Mukhsin Syuib School of Information Technology

More information

IR: Information Retrieval

IR: Information Retrieval / 44 IR: Information Retrieval FIB, Master in Innovation and Research in Informatics Slides by Marta Arias, José Luis Balcázar, Ramon Ferrer-i-Cancho, Ricard Gavaldá Department of Computer Science, UPC

More information

UML. Design Principles.

UML. Design Principles. .. Babes-Bolyai University arthur@cs.ubbcluj.ro November 20, 2018 Overview 1 2 3 Diagrams Unified Modeling Language () - a standardized general-purpose modeling language in the field of object-oriented

More information

Shareholding as a % of total no. of shares (calculated as per SCRR, 1957) Number of Voting Rights held in each class of securities

Shareholding as a % of total no. of shares (calculated as per SCRR, 1957) Number of Voting Rights held in each class of securities Sr. No. Particulars 1. Name of Listed Entity : Symphony Limited 2. Scrip Code/Name of Scrip/Class of Security : SYMPHONY 3. Share Holding Pattern Filed under : 31 (1) 4. Share Holding Pattern as on : 30

More information

5 Years (10 Semester) Integrated UG/PG Program in Physics & Electronics

5 Years (10 Semester) Integrated UG/PG Program in Physics & Electronics Courses Offered: 5 Years (10 ) Integrated UG/PG Program in Physics & Electronics 2 Years (4 ) Course M. Sc. Physics (Specialization in Material Science) In addition to the presently offered specialization,

More information

Applications of Mixed Pairwise Comparisons

Applications of Mixed Pairwise Comparisons 414 Int'l Conf. Artificial Intelligence ICAI'15 Applications of Mixed Pairwise Comparisons Abeer Mirdad and Ryszard Janicki Department of Computing and Software, McMaster University Hamilton, Ontario,

More information

Final Analysis Report MIE 313 Design of Mechanical Components

Final Analysis Report MIE 313 Design of Mechanical Components Final Analysis Report MIE 313 Design of Mechanical Components Juliana Amado Charlene Nestor Peter Walsh Table of Contents Abstract:...iii Introduction:... 4 Procedure:... 5 Results:... 6 Reliability:...

More information

BIOLOGY YEAR AT A GLANCE RESOURCE ( )

BIOLOGY YEAR AT A GLANCE RESOURCE ( ) BIOLOGY YEAR AT A GLANCE RESOURCE (2016-17) DATES TOPIC/BENCHMARKS QUARTER 1 LAB/ACTIVITIES 8/22 8/25/16 I. Introduction to Biology Lab 1: Seed Germination A. What is Biology B. Science in the real world

More information

BIOLOGY YEAR AT A GLANCE RESOURCE ( ) REVISED FOR HURRICANE DAYS

BIOLOGY YEAR AT A GLANCE RESOURCE ( ) REVISED FOR HURRICANE DAYS BIOLOGY YEAR AT A GLANCE RESOURCE (2017-18) REVISED FOR HURRICANE DAYS DATES TOPIC/BENCHMARKS QUARTER 1 LAB/ACTIVITIES 8/21 8/24/17 I. Introduction to Biology A. What is Biology B. Science in the real

More information

A Multiple Hill Climbing Approach to Software Module Clustering

A Multiple Hill Climbing Approach to Software Module Clustering A Multiple Hill Climbing Approach to Software Module Clustering Kiarash Mahdavi Mark Harman Robert Mark Hierons Department of Information Systems and Computing (DISC) Brunel University Uxbridge Middlesex

More information

CALCULUS AB/BC SUMMER REVIEW PACKET (Answers)

CALCULUS AB/BC SUMMER REVIEW PACKET (Answers) Name CALCULUS AB/BC SUMMER REVIEW PACKET (Answers) I. Simplify. Identify the zeros, vertical asymptotes, horizontal asymptotes, holes and sketch each rational function. Show the work that leads to your

More information

The CHIANTI Atomic Database

The CHIANTI Atomic Database The CHIANTI Atomic Database An Overview of Data, Software and Applications Dr Peter Young George Mason University, USA NASA Goddard Space Flight Center, USA Overview 1. Quick guide 2. History of project

More information

Real-Time Software Transactional Memory: Contention Managers, Time Bounds, and Implementations

Real-Time Software Transactional Memory: Contention Managers, Time Bounds, and Implementations Real-Time Software Transactional Memory: Contention Managers, Time Bounds, and Implementations Mohammed El-Shambakey Dissertation Submitted to the Faculty of the Virginia Polytechnic Institute and State

More information

Skew-symmetric tensor decomposition

Skew-symmetric tensor decomposition [Arrondo,, Macias Marques, Mourrain] University of Trento, Italy September 28, 2018 Warsaw Symmetric-rank C[x 0,..., x n ] d F = r λ i L d i i=1 S d C n+1 F = r i=1 λ i v d i this corresponds to find the

More information

NON-NUMERICAL RANKING BASED ON PAIRWISE COMPARISONS

NON-NUMERICAL RANKING BASED ON PAIRWISE COMPARISONS NON-NUMERICAL RANKING BASED ON PAIRWISE COMPARISONS By Yun Zhai, M.Sc. A Thesis Submitted to the School of Graduate Studies in partial fulfilment of the requirements for the degree of Ph.D. Department

More information

Phylogenetic trees 07/10/13

Phylogenetic trees 07/10/13 Phylogenetic trees 07/10/13 A tree is the only figure to occur in On the Origin of Species by Charles Darwin. It is a graphical representation of the evolutionary relationships among entities that share

More information

Cluster Analysis (Sect. 9.6/Chap. 14 of Wilks) Notes by Hong Li

Cluster Analysis (Sect. 9.6/Chap. 14 of Wilks) Notes by Hong Li 77 Cluster Analysis (Sect. 9.6/Chap. 14 of Wilks) Notes by Hong Li 1) Introduction Cluster analysis deals with separating data into groups whose identities are not known in advance. In general, even the

More information

CS626 Data Analysis and Simulation

CS626 Data Analysis and Simulation CS626 Data Analysis and Simulation Instructor: Peter Kemper R 104A, phone 221-3462, email:kemper@cs.wm.edu Today: Data Analysis: A Summary Reference: Berthold, Borgelt, Hoeppner, Klawonn: Guide to Intelligent

More information

A Note on UNAR LA-Semigroup

A Note on UNAR LA-Semigroup Punjab University Journal of Mathematics (ISSN 1016-2526) Vol. 50(3)(2018) pp. 113-121 A Note on UNAR LA-Semigroup Muhammad Rashad Department of Mathematics, University of Malakand, Pakistan, Email: rashad@uom.edu.pk

More information

MAT063 and MAT065 FINAL EXAM REVIEW FORM 1R x

MAT063 and MAT065 FINAL EXAM REVIEW FORM 1R x Page NEW YORK CITY COLLEGE OF TECHNOLOGY of the City University of New York R DEPARTMENT OF MATHEMATICS Revised Spring 0 W. Colucci, D. DeSantis, and P. Deraney. Updated Fall 0 S. Singh MAT06 and MAT06

More information

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata Principles of Pattern Recognition C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata e-mail: murthy@isical.ac.in Pattern Recognition Measurement Space > Feature Space >Decision

More information

1. 4 2y 1 2 = x = x 1 2 x + 1 = x x + 1 = x = 6. w = 2. 5 x

1. 4 2y 1 2 = x = x 1 2 x + 1 = x x + 1 = x = 6. w = 2. 5 x .... VII x + x + = x x x 8 x x = x + a = a + x x = x + x x Solve the absolute value equations.. z = 8. x + 7 =. x =. x =. y = 7 + y VIII Solve the exponential equations.. 0 x = 000. 0 x+ = 00. x+ = 8.

More information

EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE FILTER

EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE FILTER EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE FILTER Zhen Zhen 1, Jun Young Lee 2, and Abdus Saboor 3 1 Mingde College, Guizhou University, China zhenz2000@21cn.com 2 Department

More information

Online Appendix for Price Discontinuities in an Online Market for Used Cars by Florian Englmaier, Arno Schmöller, and Till Stowasser

Online Appendix for Price Discontinuities in an Online Market for Used Cars by Florian Englmaier, Arno Schmöller, and Till Stowasser Online Appendix for Price Discontinuities in an Online Market for Used Cars by Florian Englmaier, Arno Schmöller, and Till Stowasser Online Appendix A contains additional tables and figures that complement

More information

A New OCR System Similar to ASR System

A New OCR System Similar to ASR System A ew OCR System Similar to ASR System Abstract Optical character recognition (OCR) system is created using the concepts of automatic speech recognition where the hidden Markov Model is widely used. Results

More information

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University 2018 CS420, Machine Learning, Lecture 5 Tree Models Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html ML Task: Function Approximation Problem setting

More information

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic

More information

10708 Graphical Models: Homework 2

10708 Graphical Models: Homework 2 10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves

More information

Information System Decomposition Quality

Information System Decomposition Quality Information System Decomposition Quality Dr. Nejmeddine Tagoug Computer Science Department Al Imam University, SA najmtagoug@yahoo.com ABSTRACT: Object-oriented design is becoming very popular in today

More information

Analytical formulas for calculating the extremal ranks and inertias of A + BXB when X is a fixed-rank Hermitian matrix

Analytical formulas for calculating the extremal ranks and inertias of A + BXB when X is a fixed-rank Hermitian matrix Analytical formulas for calculating the extremal ranks and inertias of A + BXB when X is a fixed-rank Hermitian matrix Yongge Tian CEMA, Central University of Finance and Economics, Beijing 100081, China

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Fuzzy order-equivalence for similarity measures

Fuzzy order-equivalence for similarity measures Fuzzy order-equivalence for similarity measures Maria Rifqi, Marie-Jeanne Lesot and Marcin Detyniecki Abstract Similarity measures constitute a central component of machine learning and retrieval systems,

More information

Factory method - Increasing the reusability at the cost of understandability

Factory method - Increasing the reusability at the cost of understandability Factory method - Increasing the reusability at the cost of understandability The author Linkping University Linkping, Sweden Email: liuid23@student.liu.se Abstract This paper describes how Bansiya and

More information

Clustering Ambiguity: An Overview

Clustering Ambiguity: An Overview Clustering Ambiguity: An Overview John D. MacCuish Norah E. MacCuish 3 rd Joint Sheffield Conference on Chemoinformatics April 23, 2004 Outline The Problem: Clustering Ambiguity and Chemoinformatics Preliminaries:

More information

An Evolutionary Programming Based Algorithm for HMM training

An Evolutionary Programming Based Algorithm for HMM training An Evolutionary Programming Based Algorithm for HMM training Ewa Figielska,Wlodzimierz Kasprzak Institute of Control and Computation Engineering, Warsaw University of Technology ul. Nowowiejska 15/19,

More information

CS224W: Social and Information Network Analysis

CS224W: Social and Information Network Analysis CS224W: Social and Information Network Analysis Reaction Paper Adithya Rao, Gautam Kumar Parai, Sandeep Sripada Keywords: Self-similar networks, fractality, scale invariance, modularity, Kronecker graphs.

More information

Multivariate Statistics: Hierarchical and k-means cluster analysis

Multivariate Statistics: Hierarchical and k-means cluster analysis Multivariate Statistics: Hierarchical and k-means cluster analysis Steffen Unkel Department of Medical Statistics University Medical Center Goettingen, Germany Summer term 217 1/43 What is a cluster? Proximity

More information

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions - Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions Simon Luo The University of Sydney Data61, CSIRO simon.luo@data61.csiro.au Mahito Sugiyama National Institute of

More information

NMRDSS (Nigerian Mineral Resources Decision Support System)

NMRDSS (Nigerian Mineral Resources Decision Support System) NMRDSS (Nigerian Mineral Resources Decision Support System) AJAY YADAV 1, YYYYYYY 2 1 Technical Specialist, RMSI Private Limited 2 Position in the Org, Name of Organization A-8 Sector 16, Noida 201 301,

More information

Clustering analysis of vegetation data

Clustering analysis of vegetation data Clustering analysis of vegetation data Valentin Gjorgjioski 1, Sašo Dzeroski 1 and Matt White 2 1 Jožef Stefan Institute Jamova cesta 39, SI-1000 Ljubljana Slovenia 2 Arthur Rylah Institute for Environmental

More information

Chapter DM:II (continued)

Chapter DM:II (continued) Chapter DM:II (continued) II. Cluster Analysis Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster Analysis

More information

Divergence measure of intuitionistic fuzzy sets

Divergence measure of intuitionistic fuzzy sets Divergence measure of intuitionistic fuzzy sets Fuyuan Xiao a, a School of Computer and Information Science, Southwest University, Chongqing, 400715, China Abstract As a generation of fuzzy sets, the intuitionistic

More information

Index of Balanced Accuracy: A Performance Measure for Skewed Class Distributions

Index of Balanced Accuracy: A Performance Measure for Skewed Class Distributions Index of Balanced Accuracy: A Performance Measure for Skewed Class Distributions V. García 1,2, R.A. Mollineda 2, and J.S. Sánchez 2 1 Lab. Reconocimiento de Patrones, Instituto Tecnológico de Toluca Av.

More information

Preface. Contributors

Preface. Contributors CONTENTS Foreword Preface Contributors PART I INTRODUCTION 1 1 Networks in Biology 3 Björn H. Junker 1.1 Introduction 3 1.2 Biology 101 4 1.2.1 Biochemistry and Molecular Biology 4 1.2.2 Cell Biology 6

More information

arxiv: v1 [stat.ml] 17 Jun 2016

arxiv: v1 [stat.ml] 17 Jun 2016 Ground Truth Bias in External Cluster Validity Indices Yang Lei a,, James C. Bezdek a, Simone Romano a, Nguyen Xuan Vinh a, Jeffrey Chan b, James Bailey a arxiv:166.5596v1 [stat.ml] 17 Jun 216 Abstract

More information

B490 Mining the Big Data

B490 Mining the Big Data B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1 Motivations

More information

CS570 Introduction to Data Mining

CS570 Introduction to Data Mining CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong Data Exploration and Data Preprocessing Data and Attributes Data exploration Data pre-processing Data cleaning

More information

Group Decision-Making with Incomplete Fuzzy Linguistic Preference Relations

Group Decision-Making with Incomplete Fuzzy Linguistic Preference Relations Group Decision-Making with Incomplete Fuzzy Linguistic Preference Relations S. Alonso Department of Software Engineering University of Granada, 18071, Granada, Spain; salonso@decsai.ugr.es, F.J. Cabrerizo

More information

EP elements in rings

EP elements in rings EP elements in rings Dijana Mosić, Dragan S. Djordjević, J. J. Koliha Abstract In this paper we present a number of new characterizations of EP elements in rings with involution in purely algebraic terms,

More information

Capacitor Placement for Economical Electrical Systems using Ant Colony Search Algorithm

Capacitor Placement for Economical Electrical Systems using Ant Colony Search Algorithm Capacitor Placement for Economical Electrical Systems using Ant Colony Search Algorithm Bharat Solanki Abstract The optimal capacitor placement problem involves determination of the location, number, type

More information

Type of data Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types: Attribute Types Nominal: Pr

Type of data Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types: Attribute Types Nominal: Pr Foundation of Data Mining i Topic: Data CMSC 49D/69D CSEE Department, e t, UMBC Some of the slides used in this presentation are prepared by Jiawei Han and Micheline Kamber Data Data types Quality of data

More information

Methods for Marsh Futures Area of Interest (AOI) Elevation Zone Delineation

Methods for Marsh Futures Area of Interest (AOI) Elevation Zone Delineation PARTNERSHIP FOR THE DELAWARE ESTUARY Science Group Methods for Marsh Futures Area of Interest (AOI) Elevation Zone Delineation Date Prepared: 07/30/2015 Prepared By: Joshua Moody Suggested Citation: Moody,

More information

International Research Journal of Engineering and Technology (IRJET) e-issn: Volume: 03 Issue: 11 Nov p-issn:

International Research Journal of Engineering and Technology (IRJET) e-issn: Volume: 03 Issue: 11 Nov p-issn: Analysis of Document using Approach Sahinur Rahman Laskar 1, Bhagaban Swain 2 1,2Department of Computer Science & Engineering Assam University, Silchar, India ---------------------------------------------------------------------***---------------------------------------------------------------------

More information

METRIC BASED ATTRIBUTE REDUCTION IN DYNAMIC DECISION TABLES

METRIC BASED ATTRIBUTE REDUCTION IN DYNAMIC DECISION TABLES Annales Univ. Sci. Budapest., Sect. Comp. 42 2014 157 172 METRIC BASED ATTRIBUTE REDUCTION IN DYNAMIC DECISION TABLES János Demetrovics Budapest, Hungary Vu Duc Thi Ha Noi, Viet Nam Nguyen Long Giang Ha

More information

The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models

The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population Health

More information

Contents. Preface to the second edition. Preface to the fírst edition. Acknowledgments PART I PRELIMINARIES

Contents. Preface to the second edition. Preface to the fírst edition. Acknowledgments PART I PRELIMINARIES Contents Foreword Preface to the second edition Preface to the fírst edition Acknowledgments xvll xix xxi xxiii PART I PRELIMINARIES CHAPTER 1 Introduction 3 1.1 What Is Data Mining? 3 1.2 Where Is Data

More information

USING SINGULAR VALUE DECOMPOSITION (SVD) AS A SOLUTION FOR SEARCH RESULT CLUSTERING

USING SINGULAR VALUE DECOMPOSITION (SVD) AS A SOLUTION FOR SEARCH RESULT CLUSTERING POZNAN UNIVE RSIY OF E CHNOLOGY ACADE MIC JOURNALS No. 80 Electrical Engineering 2014 Hussam D. ABDULLA* Abdella S. ABDELRAHMAN* Vaclav SNASEL* USING SINGULAR VALUE DECOMPOSIION (SVD) AS A SOLUION FOR

More information

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability

More information

An Analysis on Consensus Measures in Group Decision Making

An Analysis on Consensus Measures in Group Decision Making An Analysis on Consensus Measures in Group Decision Making M. J. del Moral Statistics and Operational Research Email: delmoral@ugr.es F. Chiclana CCI Faculty of Technology De Montfort University Leicester

More information

Swinburne Research Bank

Swinburne Research Bank Swinburne Research Bank http://researchbank.swinburne.edu.au Chen, T. Y., Loon, P. L., & Tse, T. H. (2000). An integrated classification-tree methodology for test case generation. Electronic version of

More information

Honors ALG II Douglas Tisdahl, Instructor FIU

Honors ALG II Douglas Tisdahl, Instructor FIU Honors ALG II 2017-2018 Douglas Tisdahl, Instructor MAST @ FIU Textbook: Staff prepared materials (L.A.P.S.), Miami Dade County ALG II (Glencoe) or Florida ALG II (Prentice Hall) Calculators Utilized:

More information

Model Complexity of Pseudo-independent Models

Model Complexity of Pseudo-independent Models Model Complexity of Pseudo-independent Models Jae-Hyuck Lee and Yang Xiang Department of Computing and Information Science University of Guelph, Guelph, Canada {jaehyuck, yxiang}@cis.uoguelph,ca Abstract

More information

Some Formal Analysis of Rocchio s Similarity-Based Relevance Feedback Algorithm

Some Formal Analysis of Rocchio s Similarity-Based Relevance Feedback Algorithm Some Formal Analysis of Rocchio s Similarity-Based Relevance Feedback Algorithm Zhixiang Chen (chen@cs.panam.edu) Department of Computer Science, University of Texas-Pan American, 1201 West University

More information

Interpolation and Polynomial Approximation I

Interpolation and Polynomial Approximation I Interpolation and Polynomial Approximation I If f (n) (x), n are available, Taylor polynomial is an approximation: f (x) = f (x 0 )+f (x 0 )(x x 0 )+ 1 2! f (x 0 )(x x 0 ) 2 + Example: e x = 1 + x 1! +

More information

Clustering. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Clustering. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden Clustering Genome 373 Genomic Informatics Elhanan Borenstein Some slides adapted from Jacques van Helden The clustering problem The goal of gene clustering process is to partition the genes into distinct

More information

Rapidity evolution of Wilson lines

Rapidity evolution of Wilson lines Rapidity evolution of Wilson lines I. Balitsky JLAB & ODU QCD evolution 014 13 May 014 QCD evolution 014 13 May 014 1 / Outline 1 High-energy scattering and Wilson lines High-energy scattering and Wilson

More information

Theory of Computation

Theory of Computation Thomas Zeugmann Hokkaido University Laboratory for Algorithmics http://www-alg.ist.hokudai.ac.jp/ thomas/toc/ Lecture 13: Algorithmic Unsolvability The Halting Problem I In the last lecture we have shown

More information

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science

More information

Summer Review Packet AP Calculus

Summer Review Packet AP Calculus Summer Review Packet AP Calculus ************************************************************************ Directions for this packet: On a separate sheet of paper, show your work for each problem in this

More information

A set theoretic view of the ISA hierarchy

A set theoretic view of the ISA hierarchy Loughborough University Institutional Repository A set theoretic view of the ISA hierarchy This item was submitted to Loughborough University's Institutional Repository by the/an author. Citation: CHEUNG,

More information

Geographical Information System in Managing Mega Infrastructure Projects

Geographical Information System in Managing Mega Infrastructure Projects Geographical Information System in Managing Mega Infrastructure Projects Ankita Adhikary, M.Sc. Geomatics and Space Application, CEPT University Ahmedabad G eographic Information System (GIS) is a computer

More information

Measuring Software Coupling

Measuring Software Coupling Proceedings of the 6th WSEAS Int. Conf. on Software Engineering, Parallel and Distributed Systems, Corfu Island, Greece, February 16-19, 2007 6 Measuring Software Coupling JARALLAH S. ALGHAMDI Information

More information

Safety and Reliability of Embedded Systems. (Sicherheit und Zuverlässigkeit eingebetteter Systeme) Fault Tree Analysis Obscurities and Open Issues

Safety and Reliability of Embedded Systems. (Sicherheit und Zuverlässigkeit eingebetteter Systeme) Fault Tree Analysis Obscurities and Open Issues (Sicherheit und Zuverlässigkeit eingebetteter Systeme) Fault Tree Analysis Obscurities and Open Issues Content What are Events? Examples for Problematic Event Semantics Inhibit, Enabler / Conditioning

More information

I) Simplifying fractions: x x. 1) 1 1 y x. 1 1 x 1. 4 x. 13x. x y xy. x 2. Factoring: 10) 13) 12) III) Solving: x 9 Prime (using only) 11)

I) Simplifying fractions: x x. 1) 1 1 y x. 1 1 x 1. 4 x. 13x. x y xy. x 2. Factoring: 10) 13) 12) III) Solving: x 9 Prime (using only) 11) AP Calculus Summer Packet Answer Key Reminders:. This is not an assignment.. This will not be collected.. You WILL be assessed on these skills at various times throughout the course.. You are epected to

More information

Exact Mixed Integer Programming for Integrated Scheduling and Process Planning in Flexible Environment

Exact Mixed Integer Programming for Integrated Scheduling and Process Planning in Flexible Environment Journal of Optimization in Industrial Engineering 15 (2014) 47-53 Exact ixed Integer Programming for Integrated Scheduling and Process Planning in Flexible Environment ohammad Saidi mehrabad a, Saeed Zarghami

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

The Concept of Geographic Relevance

The Concept of Geographic Relevance The Concept of Geographic Relevance Tumasch Reichenbacher, Stefano De Sabbata, Paul Crease University of Zurich Winterthurerstr. 190 8057 Zurich, Switzerland Keywords Geographic relevance, context INTRODUCTION

More information

Sales Analysis User Manual

Sales Analysis User Manual Sales Analysis User Manual Confidential Information This document contains proprietary and valuable, confidential trade secret information of APPX Software, Inc., Richmond, Virginia Notice of Authorship

More information

Independent Component Analysis. Contents

Independent Component Analysis. Contents Contents Preface xvii 1 Introduction 1 1.1 Linear representation of multivariate data 1 1.1.1 The general statistical setting 1 1.1.2 Dimension reduction methods 2 1.1.3 Independence as a guiding principle

More information

Contents 1 Open-Source Tools, Techniques, and Data in Chemoinformatics

Contents 1 Open-Source Tools, Techniques, and Data in Chemoinformatics Contents 1 Open-Source Tools, Techniques, and Data in Chemoinformatics... 1 1.1 Chemoinformatics... 2 1.1.1 Open-Source Tools... 2 1.1.2 Introduction to Programming Languages... 3 1.2 Chemical Structure

More information

Automated Statistical Recognition of Partial Discharges in Insulation Systems.

Automated Statistical Recognition of Partial Discharges in Insulation Systems. Automated Statistical Recognition of Partial Discharges in Insulation Systems. Massih-Reza AMINI, Patrick GALLINARI, Florence d ALCHE-BUC LIP6, Université Paris 6, 4 Place Jussieu, F-75252 Paris cedex

More information

Biology IA & IB Syllabus Mr. Johns/Room 2012/August,

Biology IA & IB Syllabus Mr. Johns/Room 2012/August, Biology IA & IB Syllabus Mr. Johns/Room 2012/August, 2017-2018 Description of Course: A study of the natural world centers on cellular structure and the processes of life. First semester topics include:

More information

MACHINE LEARNING FOR GEOLOGICAL MAPPING: ALGORITHMS AND APPLICATIONS

MACHINE LEARNING FOR GEOLOGICAL MAPPING: ALGORITHMS AND APPLICATIONS MACHINE LEARNING FOR GEOLOGICAL MAPPING: ALGORITHMS AND APPLICATIONS MATTHEW J. CRACKNELL BSc (Hons) ARC Centre of Excellence in Ore Deposits (CODES) School of Physical Sciences (Earth Sciences) Submitted

More information

An Empirical Study on the Developers Perception of Software Coupling

An Empirical Study on the Developers Perception of Software Coupling An Empirical Study on the Developers Perception of Software Coupling Gabriele Bavota 1, Bogdan Dit 2, Rocco Oliveto 3, Massimiliano Di Penta 4, Denys Poshyvanyk 2, Andrea De Lucia 1 1 University of Salerno,

More information

Contents. Chapter 1 Vector Spaces. Foreword... (vii) Message...(ix) Preface...(xi)

Contents. Chapter 1 Vector Spaces. Foreword... (vii) Message...(ix) Preface...(xi) (xiii) Contents Foreword... (vii) Message...(ix) Preface...(xi) Chapter 1 Vector Spaces Vector space... 1 General Properties of vector spaces... 5 Vector Subspaces... 7 Algebra of subspaces... 11 Linear

More information

Information Theory in Computer Vision and Pattern Recognition

Information Theory in Computer Vision and Pattern Recognition Francisco Escolano Pablo Suau Boyan Bonev Information Theory in Computer Vision and Pattern Recognition Foreword by Alan Yuille ~ Springer Contents 1 Introduction...............................................

More information

An Empirical-Bayes Score for Discrete Bayesian Networks

An Empirical-Bayes Score for Discrete Bayesian Networks An Empirical-Bayes Score for Discrete Bayesian Networks scutari@stats.ox.ac.uk Department of Statistics September 8, 2016 Bayesian Network Structure Learning Learning a BN B = (G, Θ) from a data set D

More information

Some New Information Inequalities Involving f-divergences

Some New Information Inequalities Involving f-divergences BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 12, No 2 Sofia 2012 Some New Information Inequalities Involving f-divergences Amit Srivastava Department of Mathematics, Jaypee

More information

An IDL Based Image Deconvolution Software Package

An IDL Based Image Deconvolution Software Package An IDL Based Image Deconvolution Software Package F. Városi and W. B. Landsman Hughes STX Co., Code 685, NASA/GSFC, Greenbelt, MD 20771 Abstract. Using the Interactive Data Language (IDL), we have implemented

More information

Relational Nonlinear FIR Filters. Ronald K. Pearson

Relational Nonlinear FIR Filters. Ronald K. Pearson Relational Nonlinear FIR Filters Ronald K. Pearson Daniel Baugh Institute for Functional Genomics and Computational Biology Thomas Jefferson University Philadelphia, PA Moncef Gabbouj Institute of Signal

More information

13 : Variational Inference: Loopy Belief Propagation

13 : Variational Inference: Loopy Belief Propagation 10-708: Probabilistic Graphical Models 10-708, Spring 2014 13 : Variational Inference: Loopy Belief Propagation Lecturer: Eric P. Xing Scribes: Rajarshi Das, Zhengzhong Liu, Dishan Gupta 1 Introduction

More information

Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules. M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D.

Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules. M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D. Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D. Sánchez 18th September 2013 Detecting Anom and Exc Behaviour on

More information

Star-Structured High-Order Heterogeneous Data Co-clustering based on Consistent Information Theory

Star-Structured High-Order Heterogeneous Data Co-clustering based on Consistent Information Theory Star-Structured High-Order Heterogeneous Data Co-clustering based on Consistent Information Theory Bin Gao Tie-an Liu Wei-ing Ma Microsoft Research Asia 4F Sigma Center No. 49 hichun Road Beijing 00080

More information

Classification Based on Logical Concept Analysis

Classification Based on Logical Concept Analysis Classification Based on Logical Concept Analysis Yan Zhao and Yiyu Yao Department of Computer Science, University of Regina, Regina, Saskatchewan, Canada S4S 0A2 E-mail: {yanzhao, yyao}@cs.uregina.ca Abstract.

More information

Descriptive Data Summarization

Descriptive Data Summarization Descriptive Data Summarization Descriptive data summarization gives the general characteristics of the data and identify the presence of noise or outliers, which is useful for successful data cleaning

More information

Markov Chains and Spectral Clustering

Markov Chains and Spectral Clustering Markov Chains and Spectral Clustering Ning Liu 1,2 and William J. Stewart 1,3 1 Department of Computer Science North Carolina State University, Raleigh, NC 27695-8206, USA. 2 nliu@ncsu.edu, 3 billy@ncsu.edu

More information

Quantum Mechanics: Foundations and Applications

Quantum Mechanics: Foundations and Applications Arno Böhm Quantum Mechanics: Foundations and Applications Third Edition, Revised and Enlarged Prepared with Mark Loewe With 96 Illustrations Springer-Verlag New York Berlin Heidelberg London Paris Tokyo

More information