MATHEMATICAL AND EXPERIMENTAL INVESTIGATION OF ONTOLOGICAL SIMILARITY MEASURES AND THEIR USE IN BIOMEDICAL DOMAINS

Size: px

Start display at page:

Download "MATHEMATICAL AND EXPERIMENTAL INVESTIGATION OF ONTOLOGICAL SIMILARITY MEASURES AND THEIR USE IN BIOMEDICAL DOMAINS"

Naomi Atkinson
5 years ago
Views:

1 ABSTRACT MATHEMATICAL AND EXPERIMENTAL INVESTIGATION OF ONTOLOGICAL SIMILARITY MEASURES AND THEIR USE IN BIOMEDICAL DOMAINS by Xinran Yu Similarity measurement is an important notion. In the context of ontologies, similarity measures are used to determine how similar one concept is to another. Because graph models have been used to represent ontologies, a variety of algorithms have been proposed for calculating the similarity between the graph nodes which represent ontological concepts. This thesis overviews existing ontological similarity measures and investigates mathematically and experimentally a wide range of these measures. The objective is not to assess performance to a gold-standard of similarity judgment but to develop a better understanding of the relationships among these measures through comparing their results when applied to the Gene Ontology. The experimental results show that some ontological similarity measures, especially information content-based measures, are highly correlated. The results of experiments comparing corpus-based to ontology-based information content measures for the Gene Ontology support previous experimental results using WordNet which demonstrated little difference between the two approaches.

2 MATHEMATICAL AND EXPERIMENTAL INVESTIGATION OF ONTOLOGICAL SIMILARITY MEASURES AND THEIR USE IN BIOMEDICAL DOMAINS A Thesis Submitted to the Faculty of Miami University in partial fulfillment of the requirements for the degree of Master of Science Department of Computer Science by Xinran Yu Miami University Oxford, Ohio 2010 Advisor Valerie Cross, PhD. Reader Alton Sanders, PhD. Reader Eric Bachmann, PhD.

3 CONTENT 1. Introduction Brief Historical Overview of Semantic (Ontological) Similarity Overview of Standard Ontological Similarity Measures Path-based or Edge Counting Ontological Similarity Measures Information-content Ontological Similarity Measure Tversky feature-based Ontological Similarity Measures TaxPac Implementation of Standard Ontological Similarity Measures More Recent Proposals for "Novel" Ontological Similarity Measures Similarity between Biomedical Concepts Semantic Relatedness Measure Using Object Properties in an Ontology Plethora of Similarity Measures in Bioinformatics Experimental Investigations of Ontological Similarity Measures GO Description and its Concept Attributes Structure Analysis of Gene Ontology Experimental Investigation on IC and IC-based Ontological Similarity Measures IC Experiments Using the GO Experimental Investigation on GRASM method Cellular Component Sub-Ontology Analysis using Average IC Molecular Function Sub-Ontology Analysis Biological Process Sub-Ontology Analysis Experimental Investigation on Path Based Measures Cellular Component Sub-Ontology Analysis Molecular Function Sub-Ontology Analysis Biological Process Sub-Ontology Analysis Experimental Investigation on Set Based Measures Correlations among different similarity measures in the three categories A Classification of Ontological Similarity Measures Conclusions and Future Work References ii

4 Appendix iii

5 LIST OF TABLES Table 5. 1 CC Concept Attributes Table 5. 2 MF Concept Attributes Table 5. 3 BP Concept Attributes Table 5. 4 How Many Nodes Have the Following Number of Children Table 5. 5 How Many Nodes Have the Following Numbers of Parents Table 5. 6 Percentage of Descendents for Top 5 Children Concepts of The Root Concept Table 5. 7 Means and Standard Deviations for IC Values for CC Terms Table 5. 8 Pearson Correlation between IC Measures for CC Terms Table 5. 9 Spearman Correlation between IC Measures for CC Terms Table Kendall Tau Correlation between IC Measures for CC Terms Table Means and Standard Deviations for IC Values for MF Terms Table Pearson Correlation between IC Measures for MF Terms Table Spearman Correlation between IC Measures for MF Terms Table Kendall Tau Correlation between IC Measures for MF Terms Table Means and Standard Deviations for IC Values for BP Terms Table Pearson Correlation between IC Measures for BP Terms Table Spearman Correlation between IC Measures for BP Terms Table Kendall Tau Correlation between IC Measures for BP Terms Table Comparison of IC's Mean Values for CC MF and BP Table GO Max Depth Statistics Table Pearson Correlation Corpus IC to Ontology IC with k parameter Table Spearman Correlation Corpus IC to Ontology IC with k parameter Table Kendall Tau Correlation Corpus IC to Ontology IC with k parameter Table Means and Standard Deviations for IC-Based Similarity Measures for CC Terms Table Pearson Correlations for Ontological IC-based Similarity measures on CC Table Pearson Correlations for Corpus IC-based Similarity Measures on CC Table Pearson Correlations for Ontological IC vs. Corpus IC Similarity Measures for CC Table Spearman Correlations for Ontological IC-based Similarity measures on CC Table Spearman Correlations for Corpus IC-based Similarity measures on CC Table Spearman Correlations for Ontological IC vs. Corpus IC Similarity Measures for CC Table Kendall Tau Correlations for Ontological IC-based Similarity measures on CC. 57 Table Kendall Tau Correlations for Corpus IC-based Similarity measures on CC iv

6 Table Kendall Tau Correlations for Ontological IC vs. Corpus IC Similarity Measures on CC Table Means and Standard Deviations for IC ontological similarity measures for MF. 58 Table Pearson Correlations for Ontological IC-based Similarity measures on MF Table Pearson Correlations for Corpus IC-based Similarity measures on MF Table Pearson Correlations for Ontological IC vs. Corpus IC Similarity Measures for MF Table Spearman Correlations for Ontological IC-based Similarity measures on MF Table Spearman Correlations for Corpus IC-based Similarity measures on MF Table Spearman Correlations for Ontological IC vs. Corpus IC Similarity Measures for MF Table Kendall Tau Correlations for Ontological IC-based Similarity measures on MF 62 Table Kendall Tau Correlations for Corpus IC-based Similarity measures on MF Table Kendall Tau Correlations for Ontological IC vs. Corpus IC Similarity Measures on MF Table Means and Standard Deviations for IC-Based Similarity Measures for BP Terms Table Pearson Correlations for Ontological IC-based Similarity measures on BP Table Pearson Correlations for Corpus IC-based Similarity measures on BP Table Pearson Correlations for Ontological IC vs. Corpus IC Similarity Measures for BP Table Spearman Correlations for Ontological IC-based Similarity measures on BP Table Spearman Correlations for Corpus IC-based Similarity measures on BP Table Spearman Correlations for Ontological IC vs. Corpus IC Similarity Measures for BP Table Kendall Tau Correlations for Ontological IC-based Similarity measures on BP. 66 Table Kendall Tau Correlations for Corpus IC-based Similarity measures on BP Table Kendall Tau Correlations for Ontological IC vs. Corpus IC Similarity Measures on BP Table Pearson Correlations for Ontological IC vs. Corpus IC Similarity Measures Table Spearman Correlations for Ontological IC vs. Corpus IC Similarity Measures Table Kendall Tau Correlations for Ontological IC vs. Corpus IC Similarity Measures69 Table Mean and Standard Deviation for GRASM Similality Measures on CC Terms.. 71 Table Pearson Correlations for GRASM Similarity measures on CC Table Spearman Correlations for GRASM Similarity measures on CC Table Kendall Tau Correlations for GRASM Similarity measures on CC Table Mean and Standard Deviation for GRASM Similality Measures on MF Terms.. 74 Table Pearson Correlations for GRASM Similarity measures on MF v

7 Table Spearman Correlations for GRASM Similarity measures on MF Table Kendall Tau Correlations for GRASM Similarity measures on MF Table Mean and Standard Deviation for GRASM Similality Measures on BP Terms.. 76 Table Pearson Correlations for GRASM Similarity measures on BP Table Spearman Correlations for GRASM Similarity measures on BP Table Kendall Tau Correlations for GRASM Similarity measures on BP Table Means for each similarity measure to compare original to average IC Table Pearson Correlations for Each Similarity measure and their corresponding average measures Table Spearman Correlations for Each Similarity measure and their corresponding average measures Table Kendall Correlations for Each Similarity measure and their corresponding average measures Table 6. 1 Means and Standard Deviations for Path-Based Similarity Measures for CC Terms Table 6. 2 Pearson Correlations for Path-based Similarity measures on CC Table 6. 3 Spearman Correlations for Path-based Similarity measures on CC Table 6. 4 Kendall Tau Correlations for Path-based Similarity measures on CC Table 6. 5 Means and Standard Deviations for Path-Based Similarity Measures for MF Terms Table 6. 6 Pearson Correlations for Path-based Similarity measures on MF Table 6. 7 Spearman Correlations for Path-based Similarity measures on MF Table 6. 8 Kendall Tau Correlations for Path-based Similarity measures on MF Table 6. 9 Means and Standard Deviations for Path-Based Similarity Measures for BP Terms Table Pearson Correlations for Path-based Similarity measures on BP Table Spearman Correlations for Path-based Similarity measures on BP Table Kendall Tau Correlations for Path-based Similarity measures on BP Table 7. 1 Means and Standard Deviations for Set-Based Similarity Measures for CC Terms Table 7. 2 Pearson Correlations for Set-based Similarity measures on CC Table 7. 3 Spearman Correlations for Set-based Similarity measures on CC Table 7. 4 Kendall Tau Correlations for Set-based Similarity measures on CC vi

8 Table 7. 5 Means and Standard Deviations for Set-Based Similarity Measures for MF Terms Table 7. 6 Pearson Correlations for Set-based Similarity measures on MF Table 7. 7 Spearman Correlations for Set-based Similarity measures on MF Table 7. 8 Kendall Tau Correlations for Set-based Similarity measures on MF Table 7. 9 Means and Standard Deviations for Set-Based Similarity Measures for BP Terms Table Pearson Correlations for Set-based Similarity measures on BP Table Spearman Correlations for Set-based Similarity measures on BP Table Kendall Tau Correlations for Set-based Similarity measures on BP Table 8. 1 Means and Standard Deviations for Different Categories Similarity Measures for CC Terms Table 8. 2 Pearson Correlations for Different Categories Similarity measures on CC Table 8. 3 Spearman Correlations for Different Categories Similarity measures on CC Table 8. 4 Kendall Tau Correlations for Different Categories Similarity measures on CC 99 Table 8. 5 Means and Standard Deviations for Different Categories Similarity Measures for MF Terms Table 8. 6 Pearson Correlations for Different Categories Similarity measures on MF Table 8. 7 Spearman Correlations for Different Categories Similarity measures on MF Table 8. 8 Kendall Tau Correlations for Different Categories Similarity measures on MF 100 Table 8. 9 Means and Standard Deviations for Different Categories Similarity Measures for BP Terms Table Pearson Correlations for Different Categories Similarity measures on BP Table Spearman Correlations for Different Categories Similarity measures on BP 101 Table Kendall Tau Correlations for Different Categories Similarity measures on BP 101 Table 9. 1 Property Analysis of Some Ontological Similarity Measures vii

9 LIST OF FIGURES Figure 1 Illustration for Ontological Similarity Measure Examples... 6 Figure 2 Portion of the GO ( 28 Figure 3 Example for Minimum Height and Minimum Depth Figure 4 Union and Intersection of Ancestors and Descendents Figure 5 Mean Values for Each Measure in CC, MF and BP Figure 6 Pearson Correlations for Each Pair of Measures in CC, MF and BP Figure 7 Spearman Correlations for Each Pair of Measures in CC, MF and BP Figure 8 Kendall Correlations for Each Pair of Measures in CC, MF and BP Figure 9 Classification of Semantic Similarity Measures [Pesquita et al. 2009]. MICA is most informative common ancestor; DCA is disjoint common ancestor Figure 10 Classification Scheme for Ontological Similarity Measures viii

10 ACKNOWLEDGEMENTS I would like to give my sincere thanks to Prof. Valerie Cross for her valuable advice, constant encouragement and technical help on my thesis. Without her guidance, I could not have finished this work. I also want to thank my thesis committee members, Prof. Alton Sanders and Prof. Eric Bachmann for their guidance and support. I am also grateful to Dr. Joslyn and his lab PNNL. Without the funding and help from him as well the TaxPac software from PNNL, I could not have done so well on my research and experiments. Finally I would like to thank my mom for her support and care to me during the summer. ix

11 1. Introduction We live in an information society and information plays a important role almost everywhere at anytime. More and more emphasis is being placed on the semantics of the information which is made readily available on the World Wide Web. Ontologies are being used to represent the concepts and relationships among these concepts in a wide range of problem domains, particularly in bioinformatics and biomedical domains. The visual representation of such ontologies typically takes the form of a graph model where the nodes in the graph represent the concepts and the links between the nodes represent the relationships. Graph theory is being used to analyze and understand the structure of the ontologies. Similarity measurement is an important notion which is used to compare two different objects to determine how well they agree or match each other. It is extremely important in information processing in many contexts, such as search engines, collaborative filtering, and clustering. In the context of ontologies, an important use of similarity measurement is to determine how similar one concept is to another. Because graph models have been used to represent ontologies, a variety of algorithms have been proposed for calculating the similarity between or among nodes within a graph where the nodes represent the ontological concepts. Historically, the term semantic similarity measure has been used to refer to measures used to assess how similar one concept is to another concept. In the context of this thesis an ontological similarity measure is a semantic similarity measure specific to assessing similarity between concepts within an ontology. Other semantic similarity measures have also been developed using dictionary-based [Kozima and Ito 1997] and thesaurus-based [Okumura and Honda, 1994] approaches. A wide variety of ontological similarity measures have been proposed. One of the earliest and simplest ones is a path-based measure [Rada 1989] that just counts the number of edges between two nodes representing the concepts to find the distance between these concepts. This distance can be converted to a similarity measure. 1

12 Others are more complex methods such as Wu & Palmer [Wu and Palmer 1994], and Jiang & Conrath s methods [Jiang and Conrath 1997], which consider the information content of the concept. Information content has been measured in different ways such as the concept's' position or role in the graph [Seco et al. 2004] or based on a usage analysis of the concept within a reference corpus [Resnik 1995]. For the most part, these algorithms have been developed relying on the graphical representation of an ontology. As new ontological similarity measures have been proposed, evaluations of them against existing ones have typically used one of three approaches: mathematical analysis, domain-specific applications of them, and comparison of them to human judgments of similarity [Budanitsky 1999]. The primary approach, however, has been to use comparisons to human judgments of similarity. Now more recently due to use of ontological similarity in bioinformatics, the comparison has switched to other measures of similarity based on a variety of similarity measures for gene or gene products such as sequence similarity [Lord et al. 2003a]. Few efforts have been made on mathematical analysis of ontological similarity measures [Wei 1993] [Lin 1998] [Cross 2006]. Mathematical analysis explores a measure's mathematical properties, to determine whether it is a true metric and its mathematical relationship to other existing measures. This thesis research reviews existing ontological similarity methods found in the research literature, particularly in the biomedical and bioinformatics area due to the recent proliferation of these measures [Pesquita et al. 2009]. In particular, this research makes the following contributions: 1) analyze existing ontological similarity measures to discover their relationships and an ordering if possible for the results of these measures, 2) explore the role of aggregation of ontological similarity when it is used in measuring similarity between objects described using ontological concepts, 3) Investigate in detail the overlapping of different recent "novel" measures and the older existing measures, 4) Perform a controlled experimental evaluation of these measures using generated ontologies with different characteristics 2 to explore how ontological

13 structure affects a measure's performance and 5) develop a formal framework for categorization of ontological similarity measures and compare this framework to the limited existing ontological similarity categorization frameworks. This remainder of this thesis begins with section 2 providing a brief introduction to ontological similarity and its historical development noting some of the standard ontological measures. Section 3 first provides the background for the standard historical ontological similarity measures. It then describes the work done to incorporate these into the TaxPac software being developed at Pacific Northwest National Laboratories [Joslyn et al. 2009], in order to illustrate the proliferation of these measures. More recent "novel" ontological similarity measures are discussed in Section 4. Section 5, 6 7 and 8 describe experiments designed to examine similarity measures using GO s three sub ontologies. Section 9 is the classification scheme and section 10 is conclusion and future work. 3

14 2. Brief Historical Overview of Semantic (Ontological) Similarity One of the earliest proposals for similarity between concepts in a semantic network simply determines the distance between the two concepts by counting the number of edges on the shortest path between them. [Rada 1989]. This measure is considered a semantic distance. The use of the term "semantic" is a result of this distance being determined in a semantic network. The shorter this distance, the more semantically similar are the two concepts, i.e., the semantic distance can be converted into a semantic similarity measure since similarity t is considered an inverse function of the distance. This approach is criticized since it does not consider where the concepts occurred in the semantic network. This criticism is the result of the short coming of the measure to properly capture similarities in a network that was organized hierarchically. In this case, two concepts that are deeper in the hierarchy should be considered more similar and two concepts higher in the hierarchy even though the same number of edges separate the two pairs of concepts. For example, if the network is a tree structure, the two nodes directly under the root should have less similarity than two leaves which in a depth of 10; because the upper nodes are more general and lower ones are more specific. Obviously, only considering the distance between two concepts and ignoring the position of the concepts within the hierarchical structure is not enough to determine their similarity. Because of the weakness of Rada's simple edge counting measure, numerous researchers proposed methods to weight the edges in the network so that edges lower in the hierarchy were weighted less than edges higher up in the hierarchy [Lee et al. 1993] [Leacock and Chodorow 1998]. Another measure using edge count for distance scales the distance by the distance of the lowest common ancestor of the two concepts from the root of the hierarchy [Wu and Palmer 1994]. Besides edge counting methods, other researchers focuse on determining similarity between two concepts based on the shared information content of the concepts. The first proposal using this approach utilizes the information content of the lowest 4

15 common ancestor of the two concepts [Resnik 1995] to determine the amount of information shared between the two concepts. In this ontological similarity measure, information content for a concept is determined using the frequency of occurrence of the concept within a corpus to calculate a probability for that concept. Information content of a concept is then quantified as log (p(c)). The similarity between two concepts is then given as a function of the information content of the most specific parent to both concepts. A criticism of this approach is that it only looks at the shared information content and does not consider the information content of the two concepts themselves. Other researchers create a measure that incorporates the individual information content of each concept into their ontological similarity [Lin 1998]. Another approach to determining information content is also proposed instead of having to use an external corpus to determine the information content of a concept. The information content of a concept is specified as a function of the number of descendents for the concept [Seco et al. 2004] Another approach of evaluating the ontological similarity [Rodriguez and Egenhofer 2003] relies heavily on a psychological model of similarity, Tversky's parameterized ratio model of similarity [Tversky 1977]. In this model, the similarity between two objects is based on a ratio of the features they share in common to those shared and those not shared. In ontological similarity, the objects are the concepts and a variety of characteristics of the concepts have been proposed as the features for measuring ontological similarity. This section has provided an overview of standard ontological similarity in the context of their historical development but has not presented the details of the mathematical formalization of these standard measures. The next section gives examples of various ontological similarity measures along with their detailed formulas. The general categories are presented with at least one example in each category. 5

3. Overview of Standard Ontological Similarity Measures As seen in the previous section, a variety of measures have been proposed to measure the similarity between two concepts in an ontology.

16 3. Overview of Standard Ontological Similarity Measures As seen in the previous section, a variety of measures have been proposed to measure the similarity between two concepts in an ontology. Initially the two major categories of such measures have been path-based or distance-based measures and information content measures. Later an interest in a feature or set-based approach to measuring concept similarity returned. The whole basis for this approach is Tversky's parameterized ratio model of similarity [Tversky 1977]. The following overview discusses some of the standard examples in this category and is a summarization of some of the material in [Cross 2009]. Then implementations of some of these ontological measures in the TaxPac software are described. Discussion of the implementation of a Tversky generalization is included in section 6 that presents components of this thesis research. In the following sections Figure 1 is used to clarify the discussion of standard ontological similarity measures. Figure 1 Illustration for Ontological Similarity Measure Examples Assume an ontology contains the two concepts c1 and c2, for which distance and corresponding similarity is to be assessed. The concept c3 in the discussion represents a common subsumer or ancestor of c1 and c2 in the ontology that is typically selected to maximize the similarity between the two concepts. 3.1 Path-based or Edge Counting Ontological Similarity Measures Path-based measures are also referred to as edge-counting or distance-based since they are determined primarily using a distance or a count of the number of edges that 6

17 separate one concept in the ontology from another. One of the problems immediately noticed with such measures is that edges should not all represent the same uniform distance, that is edges higher in an ontology represent greater distances than those edges lower in the ontology. Also these calculations represent distances, some of which are converted into similarity using various approaches. In this section we present the progression of standard distance based ontological similarity measures. The simplest measure is Rada's edge counting distance measure [Rada 1989] distance R = min p [len(p(c1, c2))] (3.1) which produces the minimum length of all paths p between c1 and c2. Rada's measure is used in semantic networks and was not necessarily restricted to edges or links that represented is-a or hierarchical subsumption links. The obvious criticism was that edges were all weighted the same regardless of where the edges occurred in the semantic network, either high or low. An approach to converting distance measures into a similarity measure is proposed by Leacock and Chodorow (1998) as sim LC = max p (3.2) where Np is the number of nodes in path p from c1 to c2 and D is the maximum depth of the taxonomy. Another approach to using path-based distances in the calculation of ontological similarity uses the distance of the common subsumer c3 from the root of the ontology [Wu and Palmer 1994] to express the intuition that the lower the concepts c1 and c2 are in the ontology the greater the similarity of the concepts. sim WP (c1, c2) = 2 (3.3) In this measure, c3 is typically selected as the one lowest or deepest one in the ontology in order to maximize the numerator. 7

18 These ontological similarity measures are the early standard ones for path-based measures. Other more recent variations of these path-based measures are discussed in Section Information-content Ontological Similarity Measure Path-based measures of ontological similarity are criticized for not having an appropriate edge weighting mechanism to reflect the difference for path distance between concepts at a higher level. Instead, information content-based (IC) ontological measures rely on a measure of how informative concepts are within an ontology when determining how similar two concepts are. The earliest approach to determining information content is based on using an external resource such as an associated corpus for the problem domain [Resnik 1995]. The information content of a concept c is given as using standard information theory [Ross 1976] IC corpus (c) = -log p(c) (3.4) with p(c) being the probability of the occurrence of an instance of concept c in a specified corpus. The value p(c) is based on the frequency of the concept. The frequency of a concept is the number of occurrences in the corpus of all words representing that concept. The frequency of the concept also includes the total frequencies of all its children concepts. The probability is calculated by dividing this total frequency count by the total number of words in the corpus. Because the formula is the negative logarithm of the probability, as the probability increases the information content decreases; therefore, concepts higher in the ontology which have a greater probability of occurring have less information content than those lower in the ontology. Others have argued that instead of using an external resource for determining the information content of a concept, the ontology structure itself and a concept's position within that structure should be used (Seco, Veale, and Hayes 2004). The intuition is 8

19 leaf concepts in the ontology are most specific and therefore contained the most information and root concepts were the least specific and therefore contained the least information. They developed the following information content measure: IC ont (c) = log /log = 1- (3.5) where num_desc(c) is the number of descendents concept c and max ont is the maximum number of concepts in the ontology. This IC measure is normalized in that all leaf concepts have the maximum information content of 1. This information content decreases until the value is 0 for the root concept of the ontology. A new IC method proposed by Wang et.al in 2009 is called a weighted information content measure which incorporates not only the number of descendants a concept has but also the depth of the concept within the ontology: ICont-desc-depth(c) = k (1- log ) + (1-k) (3.6) The first IC based ontological similarity measure was propsed by Resnik (1995) as sim RES (c1,c2) = max S(c1,c2) [IC corpus (c)] (3.7) where S(c1,c2) is the set of concepts that subsume both c1 and c2. From Figure 1 assume that c3 is that concept which produces the maximum IC value. Basically, this measure examines all concepts that subsume concepts c1 and c2 and they do not have any descendents that subsume both c1 and c2 and uses the one that has the most information content, i.e., the most informative. A major criticism of Resnik's measure is that it only looks at shared information between the two concepts but does not incorporate the separate information content of the two concepts themselves. Lin (1998) defined another ontological measure to address this criticism: 9

20 sim Lin (c1,c2)= (3.8) where c3 is the subsuming concept with the most information content. Note that IC is not subscripted since either an external resource such as a corpus or the ontological structure could be used to determine IC. Jiang and Conrath (1997) define another distance measure between ontological concepts. Their objective is to integrate path-based measures and information content methods. Intuitively, the distance is based on totaling up their separate information contents and subtracting out twice the information content of their most informative subsumer. dist JC (c1, c2) = IC(c1) + IC(c2) 2 IC(c3) (3.9) Whatever information content remains indicates the distance between them. If there is no IC left, i.e., 0, then the two concepts are the same. This distance measure can be converted to similarity and several approaches have been proposed. For example, Seco, Veale, and Hayes 2004) used the following: Sim JC (c1, c2) = 1- (IC(c1) + IC(c2)- 2 IC(c3)) 0.5 (3.10) The relationship between the Lin ontological similarity measure and the Jiang and Conrath ontological distance measure can be seen if Lin's measure is converted into distance by subtracting it from 1 since it is normalized in [0, 1] range [Cross 2009] dist Lin (c1, c2) = 1 - sim Lin (c1,c2) = 1 - = (3.11) The dist JC ontological distance measure is simply an unnormalized version of dist Lin 10

21 3.3 Tversky feature-based Ontological Similarity Measures How humans judge similarity is an active research area of psychology. One of the most famous model for similarity assessment is Tversky s parameterized ratio model of similarity [Tversky 1997]: S Tverksy (X, Y) = (3.12) With = = 1, S becomes the Jaccard index. S jaccard (X, Y) = (3.11) With = = 1/2, S Tverksy becomes Dice s coefficient of similarity: S dice (X, Y) = (3.13) With = 1, = 0, S becomes the degree of inclusion for X, that is, the proportion of X overlapping with Y. S inclusion (X, Y) = (3.14) Similarly with = 0, = 1, S becomes the degree of inclusion for Y, the proportion of Y overlapping with X. Using this model, researchers have begun looking at a concept in an ontology as an object with a set of features. There are a wide variety of "features" that may be selected to describe a concept within an ontology. For a concept, for example, its set of features could be its set of ancestors. Then a natural ontological similarity measure between two concepts x and y and their respective set of ancestors X and Y would be the application of Tverky's parameterized ratio model of similarity. Another set of features to describe a concept is its set of descendents. Tversky's model is especially flexible in that any set of features describing a concept can be used in determining its similarity to another concept. 11

22 Researchers [Rodriquez and Egenhofer 2003] applied Tversky's model repeatedly to define a similarity measure between two classes in an ontology. Their ontological similarity measure between entity classes c1 and c2 incorporates a weighted aggregation of Tverskys similarity measures on a wide range of feature sets of the concepts including synonym sets, semantic neighborhoods, and distinguishing features. Distinguishing features are further classified into parts, functions and attributes. The only difference is that the and parameters are determined as a function of the depth of the two concepts. 3.4 TaxPac Implementation of Standard Ontological Similarity Measures TaxPac stands for Taxonomy Package which is an experimental mathematics environment for knowledge systems analysis being developed at PNNL [Joslyn and White 2009]. It is a platform available in Python built as an extension of the NetworkX system for graph analysis developed by the Los Alamos National Laboratory. Its main goal is to use mathematical order theory to express and analyze knowledge bases which can be represented in various graph structures ranging from digraphs to concept lattices. As part of the PNNL contract work this past summer, this package was extended with the basic standard ontological similarity measures described in the previous sections. These measures are included in the class BoundedDAG which stands for bounded directed acyclic graph. The implementation of these ontological measures took advantage of the existing TaxPac data structures and classes suitable for representing ontologies and concepts within them. Using and further extending this TaxPac environment is planned in order to accomplish the goals of this thesis research which will require experiments and analysis on the wide range of ontological similarity measures. In TaxPac environment, the implementation of all the standard path based measures only uses the distance between two nodes c1 and c2 through a common subsumer c3. 12

23 However, because there can be multiple subsumers (multiple parent nodes may exist for each node c1 and c2), the measures have been parameterized (min, max, ave) to allow selection of the minimum, maximum or the average of over all the similarity measures calculated using each of the common subsumers for two nodes c1 and c2. The standard ontological similarity measures assume that the common subsumer which maximizes the similarity measure should be the result. In the following section a discussion on newer measures that consider different aggregation approaches is presented and those approaches provided the motivation for the provided parameterization. For the information-content-based measures, a major aspect is what method is used to calculate the IC value, i.e., using outside corpus to probabilistically determine the IC value and assign it as a node weight or use some other node metric or node weighting scheme based on the structure of the ontology graph such as the use of the number of descendents of a node. TaxPac provides both edge-weighting and node-weighting capabilities. In the current implementation and testing, only the method proposed in [Seco et al 2004] has been tested within the TaxPac environment. Other node weighting schemes that have been coded but not tested are presented in Section 6. From the presentation on information-content based measures, one can see that all of them essentially draw from the same components: IC(c1), IC(c2) and IC(c3). A standard parameterized method was created that allows the creation of any of the IC measures based on the components selected in the standard IC formula. As with the path-based measures, the key to the standard IC measures is the common subsumer c3. Because there can be multiple common subsumers, the standard IC ontological similarity measures are also parameterized to allow selection of minimum, maximum or the average of all the similarity measures calculated using each of the common subsumers for two nodes c1 and c2. 13

24 In the following section a discussion on newer measures that consider different aggregation approaches is presented and those approaches provide the motivation for the provided parameterization. 14

25 4. More Recent Proposals for "Novel" Ontological Similarity Measures The previous section presented the standard approaches to ontological similarity measures. This section describes numerous other proposed measures many of which have been developed for use in biomedical domains. Biomedical engineering is a unique mix of engineering, medicine and science which emerged early last century. Breakthrough advances in biotechnology have given rise to rapid production of biomedical data [Spasić and Ananiadou 2005] and the creation of a wide variety of ontologies such as MeSH, SNOMED, ICD family, the Gene Ontology and so on. For example, the Gene Ontology has been used in the assessment of similarity between gene and gene products based on the ontological similarity between concepts or GO terms annotating the genes. The biomedical domain is serving as the primary impetus for the creation of new ontological similarity measures. 4.1 Similarity between Biomedical Concepts Recently two new ontological similarity measures for biomedical concepts were proposed in [Nguyen and Al-Mubaid 2006] and [Al-Mubaid and Nguyen 2009]. Actually, the measure proposed in the second paper basically uses the measure from the first paper but incorporates it into a measure between two concepts in two different ontologies. These ontologies have "bridge" concepts, i.e., concepts that occur in both ontologies. The first measure is based on the observation that the lower the two nodes are in a hierarchy, the more similar they are. This observation is not new since some existing path-based and IC ontological similarity measures make adjustments for the position of the concepts in the ontology based on their lowest (deepest) or most informative common ancestor for path-based and IC based measures respectively. Their proposed method, however, adjusts this depth by subtracting from the overall depth as: D-Depth(LCS(c1,c2)) (4.1) 15

26 Then their proposed similarity measure is defined as sim NM (c1,c2) = log 2 ((len(c1,c2) - 1) ( D-Depth(LCS(c1,c2)))+ 2) (4.2) Looking at this measure, one sees that it uses the distance between the two concepts and then increases the distance based on the difference between the greatest depth and the depth of the LCS(c1, c2). The greater the depth of the LCS(c1, c2) then the smaller the increase in the distance between c1 and c2. Therefore, concepts c1 and c2 that have the same distance between them as concepts c3 and c4 will have an LCS of greater depth than c3 and c4 will not result a smaller increase in the overall calculated distance being fed into the logarithm function. An observation we make is that this method does not have the problem of Leacock & Chodorow method. That is when two pairs of concepts have the same path distance but in different levels of the ontology, they still have the same proportional ontological similarity since the maximum depth is the same for the whole ontology.. Their above measure (actually distance measure) does use the depth of their deepest common subsumer but again adjusts by the overall depth of the whole ontology. This adjustment process is the same for each pair of concepts. Although their proposed distance method does improve Leacock and Chodorow's measure in that it takes into account of the depth of the least common ancestor of the two concepts, other ontological similarity measures such as the Wu-Palmer measure use the depth of the lowest (deepest) common ancestor without the adjustment of subtracting it from D. Their experimentation shows results got by applying the proposed method and four other existing methods on biomedical datasets. The average correlation of the proposed method between physicians and experts are higher than that of other similarity methods (except that Leacock & Chodorow method s correlation to physicians judgment is a little bit higher than the proposed one). However, when one examines the resulting tables, their measure is at best better with respect to correlation with human judgments of similarity. 16 They do not state what kind of

27 correlation measure was used in this analysis. In discussion of the results, the authors briefly mention that the Wu and Palmer method is similar to their measure in that it takes into account the depth of deepest common ancestor of two concepts. Part of this thesis research is to mathematically show the relationships between the newer proposed ontological similarity measures and the standard ones. In their more recent 2009 paper [Al-Mubaid and Nguyen 2009], they correct their measure into a SemDist measure. In this paper, the authors state that they want to combine both path length and depth of the nodes in their new measure. They incorrectly state, however, In addition, the measure of Wu and Palmer [Wu and Palmer 1994] uses only depth of concept nodes [Al-Mubaid and Nguyen 2009]. The measure they propose is the same as in the previous paper but with a few parameters. It is a path-based measure that uses the depth of the lowest common subsumer, i.e., the one that is deepest in the ontology and normalizes it by subtracting it from D the depth of the overall ontology, to define the common specificity as before in their first paper: CSpec(c1, c2) = D - depth(lcs (c1, c2)) (4.3) Here c3 = LCS(c1, c2) is not selected by the maximum information content but instead by the maximum depth. They then define a semantic distance between c1 and c2 as SemDist(c1,c2) = log((path-1) α (CSpec) ß + k) (4.4) where path is the shortest path length between the two concept nodes. This SemDist is the same measure as in their previous paper except they added parameters α, β and k. These parameters are all set to 1 in their actual experiments described in the paper so that this SemDist measure is what they proposed as their semantic similarity measure except they previously added 2 instead of 1 (k=1). This SemDist measure is to be used for concepts that occur in the same primary ontology. The objective of this paper is to also define similarity measures for concepts that occur in multiple ontologies. Their definition of primary ontology is the ontology with the greatest 17

28 granularity. The definition of the ontology with the greatest granularity is not clear but appears to be the one with the greatest depth. Then the authors propose a measure that can be used when concepts are in different ontologies but these ontologies have common "bridge" concepts. Given a primary ontology containing c1 and a secondary ontology containing c2 and a set of bridge concepts bridge i, that occur in both ontologies, the formulas all remain the same except that bridge i, is used as follows: CSpec i (c1, c2) = D - depth(lcs (c1, bridge i )) (4.5) SemDist i (c1,c2) = log((path i -1) α (CSpec i ) ß + k) (4.6) SemDist i (c1,c2) = min q [SemDist q (c1,c2)] (4.7) The path distance between c1 and c2 is calculated as the sum of c1 s distance to the bridge and c2 s distance to the bridge. The distance of c2 to the bridge is scaled by the pathrate calculated as the ratio of (2 D 1-1)/(2 D 2-1) where D 1 is the overall depth of the primary ontology and D 2 is the overall depth of the secondary ontology The bridge concept in the primary ontology also serves to determine its lowest common subsumer with concept c1 in the primary ontology. Numerous other rules are proposed for finding ontological similarity between concepts when they are in secondary ontologies. One case is when the concepts are both in the same secondary ontology. This case uses the same formula for SemDist but Path(c1, c2) in the secondary ontology is scaled by the pathrate and Cspec(c1, c2) in secondary ontology is scaled by (D 1-1)/(D 2-1) where D 1 is the overall depth of the primary ontology and D 2 is the overall depth of the secondary ontology. Their rationale is that the semantic distance between the concepts in the secondary ontology must be converted into the primary ontology scales. The other case occurs when the two concepts c1 and c2 are in different secondary ontologies and neither concept exists in the primary ontology. One of the secondary ontologies temporarily acts as the primary ontology. Their discussion of this case is not clear. 18

29 They recommend for calculating SemDist between concepts in multiple ontologies that the ontology with the greatest granularity is selected as the primary ontology. If a concept occurs in multiple secondary ontologies, they recommend selecting an ontology that has the most overlap of concepts with the primary ontology. The authors also develop a set of experiments using two vocabularies from the UMLS: SNOMED-CT and MeSH and the WordNet 2.0 ontology and several different datasets based on previous experiments that evaluate measures based on their correlation with human judgments of similarity between concepts in the vocabulary. Their experiments use WordNet 2.0 as the primary ontology and MeSH and SNOMED-CT as the secondary ontologies. One aspect that is not clear is the results of two other measures. No explanation is showed of how the results are calculated for the Leacock and Chodorow measure and the Wu and Palmer measure. These two measures are defined for a single ontology. It is not clear how they are adapted for multiple ontologies in order to produce the numbers provided in the tables. 4.2 Semantic Relatedness Measure Using Object Properties in an Ontology In [Mazuel and Sabouret 2008] a semantic relatedness measure is proposed that makes use of the Hirst & St-Onge patterns for semantically correct paths [Hirst and St-Onge 1998] and the information-theoretic paradigm introduced in [Resnik 1995] In all of the previous discussions of ontological similarity measures the type of relationship that is used to link concepts to one another is the is-a or subsumption relationship or the part-of relationship. In ontologies where other relationships exist between concepts it might be the case that there is low ontological similarity but still the concepts may be highly related. Although most measures focus on the hierarchical structuring relationships, Hirst and St-Onge proposes a semantic relatedness measure that required certain patterns or changes in direction to hold in order to calculate the semantic relatedness between 19

30 two concepts. In [Mazuel and Sabouret 2008] a relatedness measure is proposed that integrates the use of other kinds of links in determining path based measures. In their discussion, there are some errors. For example, they state: The first node-based similarity measure, proposed by Resnik in [Resnik 1995], is defined by the information content of the closest common parent (ccp) of the two concept c1and c2. This statement is incorrect. The closeness of the common parent has nothing to do with the selected common parent. It is the most informative common ancestor, i.e., the one with the highest IC value should be selected. The objective of the authors is to extend the assumption that two different hierarchical edges do not carry the same information content to non-hierarchical links. There are two situations, single relation path and mixed relation path. For single one, Jiang & Conrath method is used if the path has only "is-a" (upward) and "includes" (downward) relations although the authors state that the upward path distance has to be calculated separately from the downward path and the two added together. This approach is simply the same as dist JC (c1, c2) = IC(c1) + IC(c2) 2 IC(c3) (4.8) They use the method of calculating IC as given in [Seco et al. 2004] defined above as IC ont (c). Now for paths that use relations which are not hierarchical, a static strength is associated with each type X relation, TC X, and the path weight is calculated as : W(pathX(x,y)) = TC X (4.9) For mixed path components from concept c1 to concept c2, the path can be factorized as an ordered set of n single-relation sub-paths, and then add the single relation path weights together. They define the minimal factorization T min (path(c1, c2))) as the factorization which minimizes 20

31 the value n. The weight of the mixed path (c1, c2) is then defined as the weight sum of all sub-paths of T min. The final distance between two concepts is defined as (4.10) where the HSO(p) allows only paths that are semantically correct based on the rules of Hirst and St-Onge to be used. Since this is a distance measure, the authors convert it to a similarity measure by subtracting it from the greatest distance as: rel(c1,c2) = 2 IC max dist(c1,c2) (4.10) Tests are implemented on Miller & Charles data [Miller and Charles 1991] and the WordSimilarity-353 data set [ using the WordNet ontology (only the noun part which is the standard approach). Their experiments showe that their measure has a higher Pearson-correlation with human similarity judgments than any of the Rada, Resnik, Lin, Jiang & Conrath, Hirst & St-Onge measures. 4.3 Plethora of Similarity Measures in Bioinformatics In [Pesquita et al 2009] an overview of the wide variety of semantic similarity measures is presented. In this section, some of these measures are presented in order to illustrate the proliferation of such measures and to argue for the development of a framework to be used in comparing such measures mathematically and experimentally without using correlation with some gold-standard of similarity assessment. In this paper, the primary ontology that these similarity measures have been used with is the Gene Ontology. The performance of the semantic similarity measures is assessed on how well they can be used to determine the similarity of genes or gene products that are annotated using GO terms. The similarity between two genes or gene products is determined as an aggregation of the similarities between their sets of GO term annotations. The performance of the ontological similarity measures is then 21

Analysis and visualization of protein-protein interactions. Olga Vitek Assistant Professor Statistics and Computer Science

1 Analysis and visualization of protein-protein interactions Olga Vitek Assistant Professor Statistics and Computer Science 2 Outline 1. Protein-protein interactions 2. Using graph structures to study