MATHEMATICAL AND EXPERIMENTAL INVESTIGATION OF ONTOLOGICAL SIMILARITY MEASURES AND THEIR USE IN BIOMEDICAL DOMAINS
|
|
- Naomi Atkinson
- 5 years ago
- Views:
Transcription
1 ABSTRACT MATHEMATICAL AND EXPERIMENTAL INVESTIGATION OF ONTOLOGICAL SIMILARITY MEASURES AND THEIR USE IN BIOMEDICAL DOMAINS by Xinran Yu Similarity measurement is an important notion. In the context of ontologies, similarity measures are used to determine how similar one concept is to another. Because graph models have been used to represent ontologies, a variety of algorithms have been proposed for calculating the similarity between the graph nodes which represent ontological concepts. This thesis overviews existing ontological similarity measures and investigates mathematically and experimentally a wide range of these measures. The objective is not to assess performance to a gold-standard of similarity judgment but to develop a better understanding of the relationships among these measures through comparing their results when applied to the Gene Ontology. The experimental results show that some ontological similarity measures, especially information content-based measures, are highly correlated. The results of experiments comparing corpus-based to ontology-based information content measures for the Gene Ontology support previous experimental results using WordNet which demonstrated little difference between the two approaches.
2 MATHEMATICAL AND EXPERIMENTAL INVESTIGATION OF ONTOLOGICAL SIMILARITY MEASURES AND THEIR USE IN BIOMEDICAL DOMAINS A Thesis Submitted to the Faculty of Miami University in partial fulfillment of the requirements for the degree of Master of Science Department of Computer Science by Xinran Yu Miami University Oxford, Ohio 2010 Advisor Valerie Cross, PhD. Reader Alton Sanders, PhD. Reader Eric Bachmann, PhD.
3 CONTENT 1. Introduction Brief Historical Overview of Semantic (Ontological) Similarity Overview of Standard Ontological Similarity Measures Path-based or Edge Counting Ontological Similarity Measures Information-content Ontological Similarity Measure Tversky feature-based Ontological Similarity Measures TaxPac Implementation of Standard Ontological Similarity Measures More Recent Proposals for "Novel" Ontological Similarity Measures Similarity between Biomedical Concepts Semantic Relatedness Measure Using Object Properties in an Ontology Plethora of Similarity Measures in Bioinformatics Experimental Investigations of Ontological Similarity Measures GO Description and its Concept Attributes Structure Analysis of Gene Ontology Experimental Investigation on IC and IC-based Ontological Similarity Measures IC Experiments Using the GO Experimental Investigation on GRASM method Cellular Component Sub-Ontology Analysis using Average IC Molecular Function Sub-Ontology Analysis Biological Process Sub-Ontology Analysis Experimental Investigation on Path Based Measures Cellular Component Sub-Ontology Analysis Molecular Function Sub-Ontology Analysis Biological Process Sub-Ontology Analysis Experimental Investigation on Set Based Measures Correlations among different similarity measures in the three categories A Classification of Ontological Similarity Measures Conclusions and Future Work References ii
4 Appendix iii
5 LIST OF TABLES Table 5. 1 CC Concept Attributes Table 5. 2 MF Concept Attributes Table 5. 3 BP Concept Attributes Table 5. 4 How Many Nodes Have the Following Number of Children Table 5. 5 How Many Nodes Have the Following Numbers of Parents Table 5. 6 Percentage of Descendents for Top 5 Children Concepts of The Root Concept Table 5. 7 Means and Standard Deviations for IC Values for CC Terms Table 5. 8 Pearson Correlation between IC Measures for CC Terms Table 5. 9 Spearman Correlation between IC Measures for CC Terms Table Kendall Tau Correlation between IC Measures for CC Terms Table Means and Standard Deviations for IC Values for MF Terms Table Pearson Correlation between IC Measures for MF Terms Table Spearman Correlation between IC Measures for MF Terms Table Kendall Tau Correlation between IC Measures for MF Terms Table Means and Standard Deviations for IC Values for BP Terms Table Pearson Correlation between IC Measures for BP Terms Table Spearman Correlation between IC Measures for BP Terms Table Kendall Tau Correlation between IC Measures for BP Terms Table Comparison of IC's Mean Values for CC MF and BP Table GO Max Depth Statistics Table Pearson Correlation Corpus IC to Ontology IC with k parameter Table Spearman Correlation Corpus IC to Ontology IC with k parameter Table Kendall Tau Correlation Corpus IC to Ontology IC with k parameter Table Means and Standard Deviations for IC-Based Similarity Measures for CC Terms Table Pearson Correlations for Ontological IC-based Similarity measures on CC Table Pearson Correlations for Corpus IC-based Similarity Measures on CC Table Pearson Correlations for Ontological IC vs. Corpus IC Similarity Measures for CC Table Spearman Correlations for Ontological IC-based Similarity measures on CC Table Spearman Correlations for Corpus IC-based Similarity measures on CC Table Spearman Correlations for Ontological IC vs. Corpus IC Similarity Measures for CC Table Kendall Tau Correlations for Ontological IC-based Similarity measures on CC. 57 Table Kendall Tau Correlations for Corpus IC-based Similarity measures on CC iv
6 Table Kendall Tau Correlations for Ontological IC vs. Corpus IC Similarity Measures on CC Table Means and Standard Deviations for IC ontological similarity measures for MF. 58 Table Pearson Correlations for Ontological IC-based Similarity measures on MF Table Pearson Correlations for Corpus IC-based Similarity measures on MF Table Pearson Correlations for Ontological IC vs. Corpus IC Similarity Measures for MF Table Spearman Correlations for Ontological IC-based Similarity measures on MF Table Spearman Correlations for Corpus IC-based Similarity measures on MF Table Spearman Correlations for Ontological IC vs. Corpus IC Similarity Measures for MF Table Kendall Tau Correlations for Ontological IC-based Similarity measures on MF 62 Table Kendall Tau Correlations for Corpus IC-based Similarity measures on MF Table Kendall Tau Correlations for Ontological IC vs. Corpus IC Similarity Measures on MF Table Means and Standard Deviations for IC-Based Similarity Measures for BP Terms Table Pearson Correlations for Ontological IC-based Similarity measures on BP Table Pearson Correlations for Corpus IC-based Similarity measures on BP Table Pearson Correlations for Ontological IC vs. Corpus IC Similarity Measures for BP Table Spearman Correlations for Ontological IC-based Similarity measures on BP Table Spearman Correlations for Corpus IC-based Similarity measures on BP Table Spearman Correlations for Ontological IC vs. Corpus IC Similarity Measures for BP Table Kendall Tau Correlations for Ontological IC-based Similarity measures on BP. 66 Table Kendall Tau Correlations for Corpus IC-based Similarity measures on BP Table Kendall Tau Correlations for Ontological IC vs. Corpus IC Similarity Measures on BP Table Pearson Correlations for Ontological IC vs. Corpus IC Similarity Measures Table Spearman Correlations for Ontological IC vs. Corpus IC Similarity Measures Table Kendall Tau Correlations for Ontological IC vs. Corpus IC Similarity Measures69 Table Mean and Standard Deviation for GRASM Similality Measures on CC Terms.. 71 Table Pearson Correlations for GRASM Similarity measures on CC Table Spearman Correlations for GRASM Similarity measures on CC Table Kendall Tau Correlations for GRASM Similarity measures on CC Table Mean and Standard Deviation for GRASM Similality Measures on MF Terms.. 74 Table Pearson Correlations for GRASM Similarity measures on MF v
7 Table Spearman Correlations for GRASM Similarity measures on MF Table Kendall Tau Correlations for GRASM Similarity measures on MF Table Mean and Standard Deviation for GRASM Similality Measures on BP Terms.. 76 Table Pearson Correlations for GRASM Similarity measures on BP Table Spearman Correlations for GRASM Similarity measures on BP Table Kendall Tau Correlations for GRASM Similarity measures on BP Table Means for each similarity measure to compare original to average IC Table Pearson Correlations for Each Similarity measure and their corresponding average measures Table Spearman Correlations for Each Similarity measure and their corresponding average measures Table Kendall Correlations for Each Similarity measure and their corresponding average measures Table 6. 1 Means and Standard Deviations for Path-Based Similarity Measures for CC Terms Table 6. 2 Pearson Correlations for Path-based Similarity measures on CC Table 6. 3 Spearman Correlations for Path-based Similarity measures on CC Table 6. 4 Kendall Tau Correlations for Path-based Similarity measures on CC Table 6. 5 Means and Standard Deviations for Path-Based Similarity Measures for MF Terms Table 6. 6 Pearson Correlations for Path-based Similarity measures on MF Table 6. 7 Spearman Correlations for Path-based Similarity measures on MF Table 6. 8 Kendall Tau Correlations for Path-based Similarity measures on MF Table 6. 9 Means and Standard Deviations for Path-Based Similarity Measures for BP Terms Table Pearson Correlations for Path-based Similarity measures on BP Table Spearman Correlations for Path-based Similarity measures on BP Table Kendall Tau Correlations for Path-based Similarity measures on BP Table 7. 1 Means and Standard Deviations for Set-Based Similarity Measures for CC Terms Table 7. 2 Pearson Correlations for Set-based Similarity measures on CC Table 7. 3 Spearman Correlations for Set-based Similarity measures on CC Table 7. 4 Kendall Tau Correlations for Set-based Similarity measures on CC vi
8 Table 7. 5 Means and Standard Deviations for Set-Based Similarity Measures for MF Terms Table 7. 6 Pearson Correlations for Set-based Similarity measures on MF Table 7. 7 Spearman Correlations for Set-based Similarity measures on MF Table 7. 8 Kendall Tau Correlations for Set-based Similarity measures on MF Table 7. 9 Means and Standard Deviations for Set-Based Similarity Measures for BP Terms Table Pearson Correlations for Set-based Similarity measures on BP Table Spearman Correlations for Set-based Similarity measures on BP Table Kendall Tau Correlations for Set-based Similarity measures on BP Table 8. 1 Means and Standard Deviations for Different Categories Similarity Measures for CC Terms Table 8. 2 Pearson Correlations for Different Categories Similarity measures on CC Table 8. 3 Spearman Correlations for Different Categories Similarity measures on CC Table 8. 4 Kendall Tau Correlations for Different Categories Similarity measures on CC 99 Table 8. 5 Means and Standard Deviations for Different Categories Similarity Measures for MF Terms Table 8. 6 Pearson Correlations for Different Categories Similarity measures on MF Table 8. 7 Spearman Correlations for Different Categories Similarity measures on MF Table 8. 8 Kendall Tau Correlations for Different Categories Similarity measures on MF 100 Table 8. 9 Means and Standard Deviations for Different Categories Similarity Measures for BP Terms Table Pearson Correlations for Different Categories Similarity measures on BP Table Spearman Correlations for Different Categories Similarity measures on BP 101 Table Kendall Tau Correlations for Different Categories Similarity measures on BP 101 Table 9. 1 Property Analysis of Some Ontological Similarity Measures vii
9 LIST OF FIGURES Figure 1 Illustration for Ontological Similarity Measure Examples... 6 Figure 2 Portion of the GO ( 28 Figure 3 Example for Minimum Height and Minimum Depth Figure 4 Union and Intersection of Ancestors and Descendents Figure 5 Mean Values for Each Measure in CC, MF and BP Figure 6 Pearson Correlations for Each Pair of Measures in CC, MF and BP Figure 7 Spearman Correlations for Each Pair of Measures in CC, MF and BP Figure 8 Kendall Correlations for Each Pair of Measures in CC, MF and BP Figure 9 Classification of Semantic Similarity Measures [Pesquita et al. 2009]. MICA is most informative common ancestor; DCA is disjoint common ancestor Figure 10 Classification Scheme for Ontological Similarity Measures viii
10 ACKNOWLEDGEMENTS I would like to give my sincere thanks to Prof. Valerie Cross for her valuable advice, constant encouragement and technical help on my thesis. Without her guidance, I could not have finished this work. I also want to thank my thesis committee members, Prof. Alton Sanders and Prof. Eric Bachmann for their guidance and support. I am also grateful to Dr. Joslyn and his lab PNNL. Without the funding and help from him as well the TaxPac software from PNNL, I could not have done so well on my research and experiments. Finally I would like to thank my mom for her support and care to me during the summer. ix
11 1. Introduction We live in an information society and information plays a important role almost everywhere at anytime. More and more emphasis is being placed on the semantics of the information which is made readily available on the World Wide Web. Ontologies are being used to represent the concepts and relationships among these concepts in a wide range of problem domains, particularly in bioinformatics and biomedical domains. The visual representation of such ontologies typically takes the form of a graph model where the nodes in the graph represent the concepts and the links between the nodes represent the relationships. Graph theory is being used to analyze and understand the structure of the ontologies. Similarity measurement is an important notion which is used to compare two different objects to determine how well they agree or match each other. It is extremely important in information processing in many contexts, such as search engines, collaborative filtering, and clustering. In the context of ontologies, an important use of similarity measurement is to determine how similar one concept is to another. Because graph models have been used to represent ontologies, a variety of algorithms have been proposed for calculating the similarity between or among nodes within a graph where the nodes represent the ontological concepts. Historically, the term semantic similarity measure has been used to refer to measures used to assess how similar one concept is to another concept. In the context of this thesis an ontological similarity measure is a semantic similarity measure specific to assessing similarity between concepts within an ontology. Other semantic similarity measures have also been developed using dictionary-based [Kozima and Ito 1997] and thesaurus-based [Okumura and Honda, 1994] approaches. A wide variety of ontological similarity measures have been proposed. One of the earliest and simplest ones is a path-based measure [Rada 1989] that just counts the number of edges between two nodes representing the concepts to find the distance between these concepts. This distance can be converted to a similarity measure. 1
12 Others are more complex methods such as Wu & Palmer [Wu and Palmer 1994], and Jiang & Conrath s methods [Jiang and Conrath 1997], which consider the information content of the concept. Information content has been measured in different ways such as the concept's' position or role in the graph [Seco et al. 2004] or based on a usage analysis of the concept within a reference corpus [Resnik 1995]. For the most part, these algorithms have been developed relying on the graphical representation of an ontology. As new ontological similarity measures have been proposed, evaluations of them against existing ones have typically used one of three approaches: mathematical analysis, domain-specific applications of them, and comparison of them to human judgments of similarity [Budanitsky 1999]. The primary approach, however, has been to use comparisons to human judgments of similarity. Now more recently due to use of ontological similarity in bioinformatics, the comparison has switched to other measures of similarity based on a variety of similarity measures for gene or gene products such as sequence similarity [Lord et al. 2003a]. Few efforts have been made on mathematical analysis of ontological similarity measures [Wei 1993] [Lin 1998] [Cross 2006]. Mathematical analysis explores a measure's mathematical properties, to determine whether it is a true metric and its mathematical relationship to other existing measures. This thesis research reviews existing ontological similarity methods found in the research literature, particularly in the biomedical and bioinformatics area due to the recent proliferation of these measures [Pesquita et al. 2009]. In particular, this research makes the following contributions: 1) analyze existing ontological similarity measures to discover their relationships and an ordering if possible for the results of these measures, 2) explore the role of aggregation of ontological similarity when it is used in measuring similarity between objects described using ontological concepts, 3) Investigate in detail the overlapping of different recent "novel" measures and the older existing measures, 4) Perform a controlled experimental evaluation of these measures using generated ontologies with different characteristics 2 to explore how ontological
13 structure affects a measure's performance and 5) develop a formal framework for categorization of ontological similarity measures and compare this framework to the limited existing ontological similarity categorization frameworks. This remainder of this thesis begins with section 2 providing a brief introduction to ontological similarity and its historical development noting some of the standard ontological measures. Section 3 first provides the background for the standard historical ontological similarity measures. It then describes the work done to incorporate these into the TaxPac software being developed at Pacific Northwest National Laboratories [Joslyn et al. 2009], in order to illustrate the proliferation of these measures. More recent "novel" ontological similarity measures are discussed in Section 4. Section 5, 6 7 and 8 describe experiments designed to examine similarity measures using GO s three sub ontologies. Section 9 is the classification scheme and section 10 is conclusion and future work. 3
14 2. Brief Historical Overview of Semantic (Ontological) Similarity One of the earliest proposals for similarity between concepts in a semantic network simply determines the distance between the two concepts by counting the number of edges on the shortest path between them. [Rada 1989]. This measure is considered a semantic distance. The use of the term "semantic" is a result of this distance being determined in a semantic network. The shorter this distance, the more semantically similar are the two concepts, i.e., the semantic distance can be converted into a semantic similarity measure since similarity t is considered an inverse function of the distance. This approach is criticized since it does not consider where the concepts occurred in the semantic network. This criticism is the result of the short coming of the measure to properly capture similarities in a network that was organized hierarchically. In this case, two concepts that are deeper in the hierarchy should be considered more similar and two concepts higher in the hierarchy even though the same number of edges separate the two pairs of concepts. For example, if the network is a tree structure, the two nodes directly under the root should have less similarity than two leaves which in a depth of 10; because the upper nodes are more general and lower ones are more specific. Obviously, only considering the distance between two concepts and ignoring the position of the concepts within the hierarchical structure is not enough to determine their similarity. Because of the weakness of Rada's simple edge counting measure, numerous researchers proposed methods to weight the edges in the network so that edges lower in the hierarchy were weighted less than edges higher up in the hierarchy [Lee et al. 1993] [Leacock and Chodorow 1998]. Another measure using edge count for distance scales the distance by the distance of the lowest common ancestor of the two concepts from the root of the hierarchy [Wu and Palmer 1994]. Besides edge counting methods, other researchers focuse on determining similarity between two concepts based on the shared information content of the concepts. The first proposal using this approach utilizes the information content of the lowest 4
15 common ancestor of the two concepts [Resnik 1995] to determine the amount of information shared between the two concepts. In this ontological similarity measure, information content for a concept is determined using the frequency of occurrence of the concept within a corpus to calculate a probability for that concept. Information content of a concept is then quantified as log (p(c)). The similarity between two concepts is then given as a function of the information content of the most specific parent to both concepts. A criticism of this approach is that it only looks at the shared information content and does not consider the information content of the two concepts themselves. Other researchers create a measure that incorporates the individual information content of each concept into their ontological similarity [Lin 1998]. Another approach to determining information content is also proposed instead of having to use an external corpus to determine the information content of a concept. The information content of a concept is specified as a function of the number of descendents for the concept [Seco et al. 2004] Another approach of evaluating the ontological similarity [Rodriguez and Egenhofer 2003] relies heavily on a psychological model of similarity, Tversky's parameterized ratio model of similarity [Tversky 1977]. In this model, the similarity between two objects is based on a ratio of the features they share in common to those shared and those not shared. In ontological similarity, the objects are the concepts and a variety of characteristics of the concepts have been proposed as the features for measuring ontological similarity. This section has provided an overview of standard ontological similarity in the context of their historical development but has not presented the details of the mathematical formalization of these standard measures. The next section gives examples of various ontological similarity measures along with their detailed formulas. The general categories are presented with at least one example in each category. 5
16 3. Overview of Standard Ontological Similarity Measures As seen in the previous section, a variety of measures have been proposed to measure the similarity between two concepts in an ontology. Initially the two major categories of such measures have been path-based or distance-based measures and information content measures. Later an interest in a feature or set-based approach to measuring concept similarity returned. The whole basis for this approach is Tversky's parameterized ratio model of similarity [Tversky 1977]. The following overview discusses some of the standard examples in this category and is a summarization of some of the material in [Cross 2009]. Then implementations of some of these ontological measures in the TaxPac software are described. Discussion of the implementation of a Tversky generalization is included in section 6 that presents components of this thesis research. In the following sections Figure 1 is used to clarify the discussion of standard ontological similarity measures. Figure 1 Illustration for Ontological Similarity Measure Examples Assume an ontology contains the two concepts c1 and c2, for which distance and corresponding similarity is to be assessed. The concept c3 in the discussion represents a common subsumer or ancestor of c1 and c2 in the ontology that is typically selected to maximize the similarity between the two concepts. 3.1 Path-based or Edge Counting Ontological Similarity Measures Path-based measures are also referred to as edge-counting or distance-based since they are determined primarily using a distance or a count of the number of edges that 6
17 separate one concept in the ontology from another. One of the problems immediately noticed with such measures is that edges should not all represent the same uniform distance, that is edges higher in an ontology represent greater distances than those edges lower in the ontology. Also these calculations represent distances, some of which are converted into similarity using various approaches. In this section we present the progression of standard distance based ontological similarity measures. The simplest measure is Rada's edge counting distance measure [Rada 1989] distance R = min p [len(p(c1, c2))] (3.1) which produces the minimum length of all paths p between c1 and c2. Rada's measure is used in semantic networks and was not necessarily restricted to edges or links that represented is-a or hierarchical subsumption links. The obvious criticism was that edges were all weighted the same regardless of where the edges occurred in the semantic network, either high or low. An approach to converting distance measures into a similarity measure is proposed by Leacock and Chodorow (1998) as sim LC = max p (3.2) where Np is the number of nodes in path p from c1 to c2 and D is the maximum depth of the taxonomy. Another approach to using path-based distances in the calculation of ontological similarity uses the distance of the common subsumer c3 from the root of the ontology [Wu and Palmer 1994] to express the intuition that the lower the concepts c1 and c2 are in the ontology the greater the similarity of the concepts. sim WP (c1, c2) = 2 (3.3) In this measure, c3 is typically selected as the one lowest or deepest one in the ontology in order to maximize the numerator. 7
18 These ontological similarity measures are the early standard ones for path-based measures. Other more recent variations of these path-based measures are discussed in Section Information-content Ontological Similarity Measure Path-based measures of ontological similarity are criticized for not having an appropriate edge weighting mechanism to reflect the difference for path distance between concepts at a higher level. Instead, information content-based (IC) ontological measures rely on a measure of how informative concepts are within an ontology when determining how similar two concepts are. The earliest approach to determining information content is based on using an external resource such as an associated corpus for the problem domain [Resnik 1995]. The information content of a concept c is given as using standard information theory [Ross 1976] IC corpus (c) = -log p(c) (3.4) with p(c) being the probability of the occurrence of an instance of concept c in a specified corpus. The value p(c) is based on the frequency of the concept. The frequency of a concept is the number of occurrences in the corpus of all words representing that concept. The frequency of the concept also includes the total frequencies of all its children concepts. The probability is calculated by dividing this total frequency count by the total number of words in the corpus. Because the formula is the negative logarithm of the probability, as the probability increases the information content decreases; therefore, concepts higher in the ontology which have a greater probability of occurring have less information content than those lower in the ontology. Others have argued that instead of using an external resource for determining the information content of a concept, the ontology structure itself and a concept's position within that structure should be used (Seco, Veale, and Hayes 2004). The intuition is 8
19 leaf concepts in the ontology are most specific and therefore contained the most information and root concepts were the least specific and therefore contained the least information. They developed the following information content measure: IC ont (c) = log /log = 1- (3.5) where num_desc(c) is the number of descendents concept c and max ont is the maximum number of concepts in the ontology. This IC measure is normalized in that all leaf concepts have the maximum information content of 1. This information content decreases until the value is 0 for the root concept of the ontology. A new IC method proposed by Wang et.al in 2009 is called a weighted information content measure which incorporates not only the number of descendants a concept has but also the depth of the concept within the ontology: ICont-desc-depth(c) = k (1- log ) + (1-k) (3.6) The first IC based ontological similarity measure was propsed by Resnik (1995) as sim RES (c1,c2) = max S(c1,c2) [IC corpus (c)] (3.7) where S(c1,c2) is the set of concepts that subsume both c1 and c2. From Figure 1 assume that c3 is that concept which produces the maximum IC value. Basically, this measure examines all concepts that subsume concepts c1 and c2 and they do not have any descendents that subsume both c1 and c2 and uses the one that has the most information content, i.e., the most informative. A major criticism of Resnik's measure is that it only looks at shared information between the two concepts but does not incorporate the separate information content of the two concepts themselves. Lin (1998) defined another ontological measure to address this criticism: 9
20 sim Lin (c1,c2)= (3.8) where c3 is the subsuming concept with the most information content. Note that IC is not subscripted since either an external resource such as a corpus or the ontological structure could be used to determine IC. Jiang and Conrath (1997) define another distance measure between ontological concepts. Their objective is to integrate path-based measures and information content methods. Intuitively, the distance is based on totaling up their separate information contents and subtracting out twice the information content of their most informative subsumer. dist JC (c1, c2) = IC(c1) + IC(c2) 2 IC(c3) (3.9) Whatever information content remains indicates the distance between them. If there is no IC left, i.e., 0, then the two concepts are the same. This distance measure can be converted to similarity and several approaches have been proposed. For example, Seco, Veale, and Hayes 2004) used the following: Sim JC (c1, c2) = 1- (IC(c1) + IC(c2)- 2 IC(c3)) 0.5 (3.10) The relationship between the Lin ontological similarity measure and the Jiang and Conrath ontological distance measure can be seen if Lin's measure is converted into distance by subtracting it from 1 since it is normalized in [0, 1] range [Cross 2009] dist Lin (c1, c2) = 1 - sim Lin (c1,c2) = 1 - = (3.11) The dist JC ontological distance measure is simply an unnormalized version of dist Lin 10
21 3.3 Tversky feature-based Ontological Similarity Measures How humans judge similarity is an active research area of psychology. One of the most famous model for similarity assessment is Tversky s parameterized ratio model of similarity [Tversky 1997]: S Tverksy (X, Y) = (3.12) With = = 1, S becomes the Jaccard index. S jaccard (X, Y) = (3.11) With = = 1/2, S Tverksy becomes Dice s coefficient of similarity: S dice (X, Y) = (3.13) With = 1, = 0, S becomes the degree of inclusion for X, that is, the proportion of X overlapping with Y. S inclusion (X, Y) = (3.14) Similarly with = 0, = 1, S becomes the degree of inclusion for Y, the proportion of Y overlapping with X. Using this model, researchers have begun looking at a concept in an ontology as an object with a set of features. There are a wide variety of "features" that may be selected to describe a concept within an ontology. For a concept, for example, its set of features could be its set of ancestors. Then a natural ontological similarity measure between two concepts x and y and their respective set of ancestors X and Y would be the application of Tverky's parameterized ratio model of similarity. Another set of features to describe a concept is its set of descendents. Tversky's model is especially flexible in that any set of features describing a concept can be used in determining its similarity to another concept. 11
22 Researchers [Rodriquez and Egenhofer 2003] applied Tversky's model repeatedly to define a similarity measure between two classes in an ontology. Their ontological similarity measure between entity classes c1 and c2 incorporates a weighted aggregation of Tverskys similarity measures on a wide range of feature sets of the concepts including synonym sets, semantic neighborhoods, and distinguishing features. Distinguishing features are further classified into parts, functions and attributes. The only difference is that the and parameters are determined as a function of the depth of the two concepts. 3.4 TaxPac Implementation of Standard Ontological Similarity Measures TaxPac stands for Taxonomy Package which is an experimental mathematics environment for knowledge systems analysis being developed at PNNL [Joslyn and White 2009]. It is a platform available in Python built as an extension of the NetworkX system for graph analysis developed by the Los Alamos National Laboratory. Its main goal is to use mathematical order theory to express and analyze knowledge bases which can be represented in various graph structures ranging from digraphs to concept lattices. As part of the PNNL contract work this past summer, this package was extended with the basic standard ontological similarity measures described in the previous sections. These measures are included in the class BoundedDAG which stands for bounded directed acyclic graph. The implementation of these ontological measures took advantage of the existing TaxPac data structures and classes suitable for representing ontologies and concepts within them. Using and further extending this TaxPac environment is planned in order to accomplish the goals of this thesis research which will require experiments and analysis on the wide range of ontological similarity measures. In TaxPac environment, the implementation of all the standard path based measures only uses the distance between two nodes c1 and c2 through a common subsumer c3. 12
23 However, because there can be multiple subsumers (multiple parent nodes may exist for each node c1 and c2), the measures have been parameterized (min, max, ave) to allow selection of the minimum, maximum or the average of over all the similarity measures calculated using each of the common subsumers for two nodes c1 and c2. The standard ontological similarity measures assume that the common subsumer which maximizes the similarity measure should be the result. In the following section a discussion on newer measures that consider different aggregation approaches is presented and those approaches provided the motivation for the provided parameterization. For the information-content-based measures, a major aspect is what method is used to calculate the IC value, i.e., using outside corpus to probabilistically determine the IC value and assign it as a node weight or use some other node metric or node weighting scheme based on the structure of the ontology graph such as the use of the number of descendents of a node. TaxPac provides both edge-weighting and node-weighting capabilities. In the current implementation and testing, only the method proposed in [Seco et al 2004] has been tested within the TaxPac environment. Other node weighting schemes that have been coded but not tested are presented in Section 6. From the presentation on information-content based measures, one can see that all of them essentially draw from the same components: IC(c1), IC(c2) and IC(c3). A standard parameterized method was created that allows the creation of any of the IC measures based on the components selected in the standard IC formula. As with the path-based measures, the key to the standard IC measures is the common subsumer c3. Because there can be multiple common subsumers, the standard IC ontological similarity measures are also parameterized to allow selection of minimum, maximum or the average of all the similarity measures calculated using each of the common subsumers for two nodes c1 and c2. 13
24 In the following section a discussion on newer measures that consider different aggregation approaches is presented and those approaches provide the motivation for the provided parameterization. 14
25 4. More Recent Proposals for "Novel" Ontological Similarity Measures The previous section presented the standard approaches to ontological similarity measures. This section describes numerous other proposed measures many of which have been developed for use in biomedical domains. Biomedical engineering is a unique mix of engineering, medicine and science which emerged early last century. Breakthrough advances in biotechnology have given rise to rapid production of biomedical data [Spasić and Ananiadou 2005] and the creation of a wide variety of ontologies such as MeSH, SNOMED, ICD family, the Gene Ontology and so on. For example, the Gene Ontology has been used in the assessment of similarity between gene and gene products based on the ontological similarity between concepts or GO terms annotating the genes. The biomedical domain is serving as the primary impetus for the creation of new ontological similarity measures. 4.1 Similarity between Biomedical Concepts Recently two new ontological similarity measures for biomedical concepts were proposed in [Nguyen and Al-Mubaid 2006] and [Al-Mubaid and Nguyen 2009]. Actually, the measure proposed in the second paper basically uses the measure from the first paper but incorporates it into a measure between two concepts in two different ontologies. These ontologies have "bridge" concepts, i.e., concepts that occur in both ontologies. The first measure is based on the observation that the lower the two nodes are in a hierarchy, the more similar they are. This observation is not new since some existing path-based and IC ontological similarity measures make adjustments for the position of the concepts in the ontology based on their lowest (deepest) or most informative common ancestor for path-based and IC based measures respectively. Their proposed method, however, adjusts this depth by subtracting from the overall depth as: D-Depth(LCS(c1,c2)) (4.1) 15
26 Then their proposed similarity measure is defined as sim NM (c1,c2) = log 2 ((len(c1,c2) - 1) ( D-Depth(LCS(c1,c2)))+ 2) (4.2) Looking at this measure, one sees that it uses the distance between the two concepts and then increases the distance based on the difference between the greatest depth and the depth of the LCS(c1, c2). The greater the depth of the LCS(c1, c2) then the smaller the increase in the distance between c1 and c2. Therefore, concepts c1 and c2 that have the same distance between them as concepts c3 and c4 will have an LCS of greater depth than c3 and c4 will not result a smaller increase in the overall calculated distance being fed into the logarithm function. An observation we make is that this method does not have the problem of Leacock & Chodorow method. That is when two pairs of concepts have the same path distance but in different levels of the ontology, they still have the same proportional ontological similarity since the maximum depth is the same for the whole ontology.. Their above measure (actually distance measure) does use the depth of their deepest common subsumer but again adjusts by the overall depth of the whole ontology. This adjustment process is the same for each pair of concepts. Although their proposed distance method does improve Leacock and Chodorow's measure in that it takes into account of the depth of the least common ancestor of the two concepts, other ontological similarity measures such as the Wu-Palmer measure use the depth of the lowest (deepest) common ancestor without the adjustment of subtracting it from D. Their experimentation shows results got by applying the proposed method and four other existing methods on biomedical datasets. The average correlation of the proposed method between physicians and experts are higher than that of other similarity methods (except that Leacock & Chodorow method s correlation to physicians judgment is a little bit higher than the proposed one). However, when one examines the resulting tables, their measure is at best better with respect to correlation with human judgments of similarity. 16 They do not state what kind of
27 correlation measure was used in this analysis. In discussion of the results, the authors briefly mention that the Wu and Palmer method is similar to their measure in that it takes into account the depth of deepest common ancestor of two concepts. Part of this thesis research is to mathematically show the relationships between the newer proposed ontological similarity measures and the standard ones. In their more recent 2009 paper [Al-Mubaid and Nguyen 2009], they correct their measure into a SemDist measure. In this paper, the authors state that they want to combine both path length and depth of the nodes in their new measure. They incorrectly state, however, In addition, the measure of Wu and Palmer [Wu and Palmer 1994] uses only depth of concept nodes [Al-Mubaid and Nguyen 2009]. The measure they propose is the same as in the previous paper but with a few parameters. It is a path-based measure that uses the depth of the lowest common subsumer, i.e., the one that is deepest in the ontology and normalizes it by subtracting it from D the depth of the overall ontology, to define the common specificity as before in their first paper: CSpec(c1, c2) = D - depth(lcs (c1, c2)) (4.3) Here c3 = LCS(c1, c2) is not selected by the maximum information content but instead by the maximum depth. They then define a semantic distance between c1 and c2 as SemDist(c1,c2) = log((path-1) α (CSpec) ß + k) (4.4) where path is the shortest path length between the two concept nodes. This SemDist is the same measure as in their previous paper except they added parameters α, β and k. These parameters are all set to 1 in their actual experiments described in the paper so that this SemDist measure is what they proposed as their semantic similarity measure except they previously added 2 instead of 1 (k=1). This SemDist measure is to be used for concepts that occur in the same primary ontology. The objective of this paper is to also define similarity measures for concepts that occur in multiple ontologies. Their definition of primary ontology is the ontology with the greatest 17
28 granularity. The definition of the ontology with the greatest granularity is not clear but appears to be the one with the greatest depth. Then the authors propose a measure that can be used when concepts are in different ontologies but these ontologies have common "bridge" concepts. Given a primary ontology containing c1 and a secondary ontology containing c2 and a set of bridge concepts bridge i, that occur in both ontologies, the formulas all remain the same except that bridge i, is used as follows: CSpec i (c1, c2) = D - depth(lcs (c1, bridge i )) (4.5) SemDist i (c1,c2) = log((path i -1) α (CSpec i ) ß + k) (4.6) SemDist i (c1,c2) = min q [SemDist q (c1,c2)] (4.7) The path distance between c1 and c2 is calculated as the sum of c1 s distance to the bridge and c2 s distance to the bridge. The distance of c2 to the bridge is scaled by the pathrate calculated as the ratio of (2 D 1-1)/(2 D 2-1) where D 1 is the overall depth of the primary ontology and D 2 is the overall depth of the secondary ontology The bridge concept in the primary ontology also serves to determine its lowest common subsumer with concept c1 in the primary ontology. Numerous other rules are proposed for finding ontological similarity between concepts when they are in secondary ontologies. One case is when the concepts are both in the same secondary ontology. This case uses the same formula for SemDist but Path(c1, c2) in the secondary ontology is scaled by the pathrate and Cspec(c1, c2) in secondary ontology is scaled by (D 1-1)/(D 2-1) where D 1 is the overall depth of the primary ontology and D 2 is the overall depth of the secondary ontology. Their rationale is that the semantic distance between the concepts in the secondary ontology must be converted into the primary ontology scales. The other case occurs when the two concepts c1 and c2 are in different secondary ontologies and neither concept exists in the primary ontology. One of the secondary ontologies temporarily acts as the primary ontology. Their discussion of this case is not clear. 18
29 They recommend for calculating SemDist between concepts in multiple ontologies that the ontology with the greatest granularity is selected as the primary ontology. If a concept occurs in multiple secondary ontologies, they recommend selecting an ontology that has the most overlap of concepts with the primary ontology. The authors also develop a set of experiments using two vocabularies from the UMLS: SNOMED-CT and MeSH and the WordNet 2.0 ontology and several different datasets based on previous experiments that evaluate measures based on their correlation with human judgments of similarity between concepts in the vocabulary. Their experiments use WordNet 2.0 as the primary ontology and MeSH and SNOMED-CT as the secondary ontologies. One aspect that is not clear is the results of two other measures. No explanation is showed of how the results are calculated for the Leacock and Chodorow measure and the Wu and Palmer measure. These two measures are defined for a single ontology. It is not clear how they are adapted for multiple ontologies in order to produce the numbers provided in the tables. 4.2 Semantic Relatedness Measure Using Object Properties in an Ontology In [Mazuel and Sabouret 2008] a semantic relatedness measure is proposed that makes use of the Hirst & St-Onge patterns for semantically correct paths [Hirst and St-Onge 1998] and the information-theoretic paradigm introduced in [Resnik 1995] In all of the previous discussions of ontological similarity measures the type of relationship that is used to link concepts to one another is the is-a or subsumption relationship or the part-of relationship. In ontologies where other relationships exist between concepts it might be the case that there is low ontological similarity but still the concepts may be highly related. Although most measures focus on the hierarchical structuring relationships, Hirst and St-Onge proposes a semantic relatedness measure that required certain patterns or changes in direction to hold in order to calculate the semantic relatedness between 19
30 two concepts. In [Mazuel and Sabouret 2008] a relatedness measure is proposed that integrates the use of other kinds of links in determining path based measures. In their discussion, there are some errors. For example, they state: The first node-based similarity measure, proposed by Resnik in [Resnik 1995], is defined by the information content of the closest common parent (ccp) of the two concept c1and c2. This statement is incorrect. The closeness of the common parent has nothing to do with the selected common parent. It is the most informative common ancestor, i.e., the one with the highest IC value should be selected. The objective of the authors is to extend the assumption that two different hierarchical edges do not carry the same information content to non-hierarchical links. There are two situations, single relation path and mixed relation path. For single one, Jiang & Conrath method is used if the path has only "is-a" (upward) and "includes" (downward) relations although the authors state that the upward path distance has to be calculated separately from the downward path and the two added together. This approach is simply the same as dist JC (c1, c2) = IC(c1) + IC(c2) 2 IC(c3) (4.8) They use the method of calculating IC as given in [Seco et al. 2004] defined above as IC ont (c). Now for paths that use relations which are not hierarchical, a static strength is associated with each type X relation, TC X, and the path weight is calculated as : W(pathX(x,y)) = TC X (4.9) For mixed path components from concept c1 to concept c2, the path can be factorized as an ordered set of n single-relation sub-paths, and then add the single relation path weights together. They define the minimal factorization T min (path(c1, c2))) as the factorization which minimizes 20
31 the value n. The weight of the mixed path (c1, c2) is then defined as the weight sum of all sub-paths of T min. The final distance between two concepts is defined as (4.10) where the HSO(p) allows only paths that are semantically correct based on the rules of Hirst and St-Onge to be used. Since this is a distance measure, the authors convert it to a similarity measure by subtracting it from the greatest distance as: rel(c1,c2) = 2 IC max dist(c1,c2) (4.10) Tests are implemented on Miller & Charles data [Miller and Charles 1991] and the WordSimilarity-353 data set [ using the WordNet ontology (only the noun part which is the standard approach). Their experiments showe that their measure has a higher Pearson-correlation with human similarity judgments than any of the Rada, Resnik, Lin, Jiang & Conrath, Hirst & St-Onge measures. 4.3 Plethora of Similarity Measures in Bioinformatics In [Pesquita et al 2009] an overview of the wide variety of semantic similarity measures is presented. In this section, some of these measures are presented in order to illustrate the proliferation of such measures and to argue for the development of a framework to be used in comparing such measures mathematically and experimentally without using correlation with some gold-standard of similarity assessment. In this paper, the primary ontology that these similarity measures have been used with is the Gene Ontology. The performance of the semantic similarity measures is assessed on how well they can be used to determine the similarity of genes or gene products that are annotated using GO terms. The similarity between two genes or gene products is determined as an aggregation of the similarities between their sets of GO term annotations. The performance of the ontological similarity measures is then 21
Analysis and visualization of protein-protein interactions. Olga Vitek Assistant Professor Statistics and Computer Science
1 Analysis and visualization of protein-protein interactions Olga Vitek Assistant Professor Statistics and Computer Science 2 Outline 1. Protein-protein interactions 2. Using graph structures to study
More informationSemantic Similarity and Relatedness
Semantic Relatedness Semantic Similarity and Relatedness (Based on Budanitsky, Hirst 2006 and Chapter 20 of Jurafsky/Martin 2 nd. Ed. - Most figures taken from either source.) Many applications require
More informationFrancisco M. Couto Mário J. Silva Pedro Coutinho
Francisco M. Couto Mário J. Silva Pedro Coutinho DI FCUL TR 03 29 Departamento de Informática Faculdade de Ciências da Universidade de Lisboa Campo Grande, 1749 016 Lisboa Portugal Technical reports are
More informationReview Article From Ontology to Semantic Similarity: Calculation of Ontology-Based Semantic Similarity
Hindawi Publishing Corporation The Scientific World Journal Volume 2013, Article ID 793091, 11 pages http://dx.doi.org/10.1155/2013/793091 Review Article From Ontology to Semantic Similarity: Calculation
More information2 GENE FUNCTIONAL SIMILARITY. 2.1 Semantic values of GO terms
Bioinformatics Advance Access published March 7, 2007 The Author (2007). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
More informationA Study of Correlations between the Definition and Application of the Gene Ontology
University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Theses, Dissertations, & Student Research in Computer Electronics & Engineering Electrical & Computer Engineering, Department
More informationA set theoretic view of the ISA hierarchy
Loughborough University Institutional Repository A set theoretic view of the ISA hierarchy This item was submitted to Loughborough University's Institutional Repository by the/an author. Citation: CHEUNG,
More informationEvent Operators: Formalization, Algorithms, and Implementation Using Interval- Based Semantics
Department of Computer Science and Engineering University of Texas at Arlington Arlington, TX 76019 Event Operators: Formalization, Algorithms, and Implementation Using Interval- Based Semantics Raman
More informationFunctional Characterization and Topological Modularity of Molecular Interaction Networks
Functional Characterization and Topological Modularity of Molecular Interaction Networks Jayesh Pandey 1 Mehmet Koyutürk 2 Ananth Grama 1 1 Department of Computer Science Purdue University 2 Department
More informationToponym Disambiguation using Ontology-based Semantic Similarity
Toponym Disambiguation using Ontology-based Semantic Similarity David S Batista 1, João D Ferreira 2, Francisco M Couto 2, and Mário J Silva 1 1 IST/INESC-ID Lisbon, Portugal {dsbatista,msilva}@inesc-id.pt
More informationThe OntoNL Semantic Relatedness Measure for OWL Ontologies
The OntoNL Semantic Relatedness Measure for OWL Ontologies Anastasia Karanastasi and Stavros hristodoulakis Laboratory of Distributed Multimedia Information Systems and Applications Technical University
More informationMeasuring Semantic Similarity between Gene Ontology Terms
Measuring Semantic Similarity between Gene Ontology Terms Francisco M. Couto a Mário J. Silva a Pedro M. Coutinho b a Departamento de Informática, Faculdade de Ciências da Universidade de Lisboa, Portugal
More informationSimilarity for Conceptual Querying
Similarity for Conceptual Querying Troels Andreasen, Henrik Bulskov, and Rasmus Knappe Department of Computer Science, Roskilde University, P.O. Box 260, DK-4000 Roskilde, Denmark {troels,bulskov,knappe}@ruc.dk
More informationJohn Pavlopoulos and Ion Androutsopoulos NLP Group, Department of Informatics Athens University of Economics and Business, Greece
John Pavlopoulos and Ion Androutsopoulos NLP Group, Department of Informatics Athens University of Economics and Business, Greece http://nlp.cs.aueb.gr/ A laptop with great design, but the service was
More informationA PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS
A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS CRYSTAL L. KAHN and BENJAMIN J. RAPHAEL Box 1910, Brown University Department of Computer Science & Center for Computational Molecular Biology
More informationCalculating Semantic Relatedness with GermaNet
Organismus, Lebewesen organism, being Katze cat... Haustier pet Hund dog...... Baum tree Calculating Semantic Relatedness with GermaNet Verena Henrich, Düsseldorf, 19. Februar 2015 Semantic Relatedness
More informationGENE ONTOLOGY (GO) Wilver Martínez Martínez Giovanny Silva Rincón
GENE ONTOLOGY (GO) Wilver Martínez Martínez Giovanny Silva Rincón What is GO? The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products in
More informationInformation-theoretic and Set-theoretic Similarity
Information-theoretic and Set-theoretic Similarity Luca Cazzanti Applied Physics Lab University of Washington Seattle, WA 98195, USA Email: luca@apl.washington.edu Maya R. Gupta Department of Electrical
More informationExploring Spatial Relationships for Knowledge Discovery in Spatial Data
2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Exploring Spatial Relationships for Knowledge Discovery in Spatial Norazwin Buang
More informationUsing C-OWL for the Alignment and Merging of Medical Ontologies
Using C-OWL for the Alignment and Merging of Medical Ontologies Heiner Stuckenschmidt 1, Frank van Harmelen 1 Paolo Bouquet 2,3, Fausto Giunchiglia 2,3, Luciano Serafini 3 1 Vrije Universiteit Amsterdam
More informationWEST: WEIGHTED-EDGE BASED SIMILARITY MEASUREMENT TOOLS FOR WORD SEMANTICS
WEST: WEIGHTED-EDGE BASED SIMILARITY MEASUREMENT TOOLS FOR WORD SEMANTICS Liang Dong, Pradip K. Srimani, James Z. Wang School of Computing, Clemson University Web Intelligence 2010, September 1, 2010 Outline
More informationHierachical Name Entity Recognition
Hierachical Name Entity Recognition Dakan Wang, Yu Wu Mentor: David Mcclosky, Mihai Surdeanu March 10, 2011 1 Introduction In this project, we investigte the hierarchical name entity recognition problem
More informationA Multiobjective GO based Approach to Protein Complex Detection
Available online at www.sciencedirect.com Procedia Technology 4 (2012 ) 555 560 C3IT-2012 A Multiobjective GO based Approach to Protein Complex Detection Sumanta Ray a, Moumita De b, Anirban Mukhopadhyay
More informationTowards an Efficient Combination of Similarity Measures for Semantic Relation Extraction
Towards an Efficient Combination of Similarity Measures for Semantic Relation Extraction Alexander Panchenko alexander.panchenko@student.uclouvain.be Université catholique de Louvain & Bauman Moscow State
More informationExperimental designs for multiple responses with different models
Graduate Theses and Dissertations Graduate College 2015 Experimental designs for multiple responses with different models Wilmina Mary Marget Iowa State University Follow this and additional works at:
More informationToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database
ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database Dina Vishnyakova 1,2, 4, *, Julien Gobeill 1,3,4, Emilie Pasche 1,2,3,4 and Patrick Ruch
More informationTest of Complete Spatial Randomness on Networks
Test of Complete Spatial Randomness on Networks A PROJECT SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Xinyue Chang IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE
More informationA Game-Theoretic Approach to Graph Transduction: An Experimental Study
MSc (ex D.M. 270/2004) in Computer Science Dissertation A Game-Theoretic Approach to Graph Transduction: An Experimental Study Supervisor Prof. Marcello Pelillo Candidate Michele Schiavinato Id 810469
More informationLatent Semantic Analysis. Hongning Wang
Latent Semantic Analysis Hongning Wang CS@UVa VS model in practice Document and query are represented by term vectors Terms are not necessarily orthogonal to each other Synonymy: car v.s. automobile Polysemy:
More informationDiscovering molecular pathways from protein interaction and ge
Discovering molecular pathways from protein interaction and gene expression data 9-4-2008 Aim To have a mechanism for inferring pathways from gene expression and protein interaction data. Motivation Why
More informationPaper presented at the 9th AGILE Conference on Geographic Information Science, Visegrád, Hungary,
220 A Framework for Intensional and Extensional Integration of Geographic Ontologies Eleni Tomai 1 and Poulicos Prastacos 2 1 Research Assistant, 2 Research Director - Institute of Applied and Computational
More information2 : Directed GMs: Bayesian Networks
10-708: Probabilistic Graphical Models, Spring 2015 2 : Directed GMs: Bayesian Networks Lecturer: Eric P. Xing Scribes: Yi Cheng, Cong Lu 1 Notation Here the notations used in this course are defined:
More informationLeast Common Subsumers and Most Specific Concepts in a Description Logic with Existential Restrictions and Terminological Cycles*
Least Common Subsumers and Most Specific Concepts in a Description Logic with Existential Restrictions and Terminological Cycles* Franz Baader Theoretical Computer Science TU Dresden D-01062 Dresden, Germany
More informationCSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68
CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68 References 1 L. Freeman, Centrality in Social Networks: Conceptual Clarification, Social Networks, Vol. 1, 1978/1979, pp. 215 239. 2 S. Wasserman
More informationDisease Ontology Semantic and Enrichment analysis
Disease Ontology Semantic and Enrichment analysis Guangchuang Yu, Li-Gen Wang Jinan University, Guangzhou, China April 21, 2012 1 Introduction Disease Ontology (DO) provides an open source ontology for
More informationOSS: A Semantic Similarity Function based on Hierarchical Ontologies
OSS: A Semantic Similarity Function based on Hierarchical Ontologies Vincent Schickel-Zuber and Boi Faltings Swiss Federal Institute of Technology - EPFL Artificial Intelligence Laboratory vincent.schickel-zuber@epfl.ch,
More informationClustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden
Clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein Some slides adapted from Jacques van Helden Gene expression profiling A quick review Which molecular processes/functions
More informationJust: a Tool for Computing Justifications w.r.t. ELH Ontologies
Just: a Tool for Computing Justifications w.r.t. ELH Ontologies Michel Ludwig Theoretical Computer Science, TU Dresden, Germany michel@tcs.inf.tu-dresden.de Abstract. We introduce the tool Just for computing
More informationTwo-sample Categorical data: Testing
Two-sample Categorical data: Testing Patrick Breheny October 29 Patrick Breheny Biostatistical Methods I (BIOS 5710) 1/22 Lister s experiment Introduction In the 1860s, Joseph Lister conducted a landmark
More informationBiological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor
Biological Networks:,, and via Relative Description Length By: Tamir Tuller & Benny Chor Presented by: Noga Grebla Content of the presentation Presenting the goals of the research Reviewing basic terms
More informationRole-depth Bounded Least Common Subsumers by Completion for EL- and Prob-EL-TBoxes
Role-depth Bounded Least Common Subsumers by Completion for EL- and Prob-EL-TBoxes Rafael Peñaloza and Anni-Yasmin Turhan TU Dresden, Institute for Theoretical Computer Science Abstract. The least common
More informationProbabilistic Graphical Networks: Definitions and Basic Results
This document gives a cursory overview of Probabilistic Graphical Networks. The material has been gleaned from different sources. I make no claim to original authorship of this material. Bayesian Graphical
More informationOutline. Structure-Based Partitioning of Large Concept Hierarchies. Ontologies and the Semantic Web. The Case for Partitioning
Outline Structure-Based Partitioning of Large Concept Hierarchies Heiner Stuckenschmidt, Michel Klein Vrije Universiteit Amsterdam Motivation: The Case for Ontology Partitioning Lots of Pictures A Partitioning
More informationKey Words: geospatial ontologies, formal concept analysis, semantic integration, multi-scale, multi-context.
Marinos Kavouras & Margarita Kokla Department of Rural and Surveying Engineering National Technical University of Athens 9, H. Polytechniou Str., 157 80 Zografos Campus, Athens - Greece Tel: 30+1+772-2731/2637,
More informationScrutinizing the relationships between SNOMED CT concepts and semantic tags
Bona and Ceusters Scrutinizing the relationships between SNOMED CT concepts and semantic tags Jonathan Bona 1, * and Werner Ceusters 2 1 Department of Biomedical Informatics, University of Arkansas for
More informationASSESSING AND EVALUATING RECREATION RESOURCE IMPACTS: SPATIAL ANALYTICAL APPROACHES. Yu-Fai Leung
ASSESSING AND EVALUATING RECREATION RESOURCE IMPACTS: SPATIAL ANALYTICAL APPROACHES by Yu-Fai Leung Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial
More informationMeronymy-based Aggregation of Activities in Business Process Models
Meronymy-based Aggregation of Activities in Business Process Models Sergey Smirnov 1, Remco Dijkman 2, Jan Mendling 3, and Mathias Weske 1 1 Hasso Plattner Institute, Germany 2 Eindhoven University of
More informationA Bayesian. Network Model of Pilot Response to TCAS RAs. MIT Lincoln Laboratory. Robert Moss & Ted Londner. Federal Aviation Administration
A Bayesian Network Model of Pilot Response to TCAS RAs Robert Moss & Ted Londner MIT Lincoln Laboratory ATM R&D Seminar June 28, 2017 This work is sponsored by the under Air Force Contract #FA8721-05-C-0002.
More informationBayesian Hierarchical Classification. Seminar on Predicting Structured Data Jukka Kohonen
Bayesian Hierarchical Classification Seminar on Predicting Structured Data Jukka Kohonen 17.4.2008 Overview Intro: The task of hierarchical gene annotation Approach I: SVM/Bayes hybrid Barutcuoglu et al:
More informationTheoretical Foundations of the UML Lecture 18: Statecharts Semantics (1)
Theoretical Foundations of the UML Lecture 18: Statecharts Semantics (1) Joost-Pieter Katoen Lehrstuhl für Informatik 2 Software Modeling and Verification Group http://moves.rwth-aachen.de/teaching/ws-1415/uml/
More informationNon-impeding Noisy-AND Tree Causal Models Over Multi-valued Variables
Non-impeding Noisy-AND Tree Causal Models Over Multi-valued Variables Yang Xiang School of Computer Science, University of Guelph, Canada Abstract To specify a Bayesian network (BN), a conditional probability
More informationAlexander Klippel and Chris Weaver. GeoVISTA Center, Department of Geography The Pennsylvania State University, PA, USA
Analyzing Behavioral Similarity Measures in Linguistic and Non-linguistic Conceptualization of Spatial Information and the Question of Individual Differences Alexander Klippel and Chris Weaver GeoVISTA
More informationA HYBRID SEMANTIC SIMILARITY MEASURING APPROACH FOR ANNOTATING WSDL DOCUMENTS WITH ONTOLOGY CONCEPTS. Received February 2017; revised May 2017
International Journal of Innovative Computing, Information and Control ICIC International c 2017 ISSN 1349-4198 Volume 13, Number 4, August 2017 pp. 1221 1242 A HYBRID SEMANTIC SIMILARITY MEASURING APPROACH
More informationOWL Semantics COMP Sean Bechhofer Uli Sattler
OWL Semantics COMP62342 Sean Bechhofer sean.bechhofer@manchester.ac.uk Uli Sattler uli.sattler@manchester.ac.uk 1 Toward Knowledge Formalization Acquisition Process Elicit tacit knowledge A set of terms/concepts
More informationEfficient Reassembling of Graphs, Part 1: The Linear Case
Efficient Reassembling of Graphs, Part 1: The Linear Case Assaf Kfoury Boston University Saber Mirzaei Boston University Abstract The reassembling of a simple connected graph G = (V, E) is an abstraction
More informationAn Approach to Constructing Good Two-level Orthogonal Factorial Designs with Large Run Sizes
An Approach to Constructing Good Two-level Orthogonal Factorial Designs with Large Run Sizes by Chenlu Shi B.Sc. (Hons.), St. Francis Xavier University, 013 Project Submitted in Partial Fulfillment of
More informationHYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH
HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH Hoang Trang 1, Tran Hoang Loc 1 1 Ho Chi Minh City University of Technology-VNU HCM, Ho Chi
More informationAn Algebraic View of the Relation between Largest Common Subtrees and Smallest Common Supertrees
An Algebraic View of the Relation between Largest Common Subtrees and Smallest Common Supertrees Francesc Rosselló 1, Gabriel Valiente 2 1 Department of Mathematics and Computer Science, Research Institute
More informationClassification Based on Logical Concept Analysis
Classification Based on Logical Concept Analysis Yan Zhao and Yiyu Yao Department of Computer Science, University of Regina, Regina, Saskatchewan, Canada S4S 0A2 E-mail: {yanzhao, yyao}@cs.uregina.ca Abstract.
More informationMatroid Secretary for Regular and Decomposable Matroids
Matroid Secretary for Regular and Decomposable Matroids Michael Dinitz Weizmann Institute of Science mdinitz@cs.cmu.edu Guy Kortsarz Rutgers University, Camden guyk@camden.rutgers.edu Abstract In the matroid
More informationSolving Classification Problems By Knowledge Sets
Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose
More informationModern Information Retrieval
Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction
More informationTwo-sample Categorical data: Testing
Two-sample Categorical data: Testing Patrick Breheny April 1 Patrick Breheny Introduction to Biostatistics (171:161) 1/28 Separate vs. paired samples Despite the fact that paired samples usually offer
More informationWorkshop: Biosystematics
Workshop: Biosystematics by Julian Lee (revised by D. Krempels) Biosystematics (sometimes called simply "systematics") is that biological sub-discipline that is concerned with the theory and practice of
More informationClustering. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden
Clustering Genome 373 Genomic Informatics Elhanan Borenstein Some slides adapted from Jacques van Helden The clustering problem The goal of gene clustering process is to partition the genes into distinct
More informationCausality II: How does causal inference fit into public health and what it is the role of statistics?
Causality II: How does causal inference fit into public health and what it is the role of statistics? Statistics for Psychosocial Research II November 13, 2006 1 Outline Potential Outcomes / Counterfactual
More informationComputability of Heyting algebras and. Distributive Lattices
Computability of Heyting algebras and Distributive Lattices Amy Turlington, Ph.D. University of Connecticut, 2010 Distributive lattices are studied from the viewpoint of effective algebra. In particular,
More informationAuthor Entropy vs. File Size in the GNOME Suite of Applications
Brigham Young University BYU ScholarsArchive All Faculty Publications 2009-01-01 Author Entropy vs. File Size in the GNOME Suite of Applications Jason R. Casebolt caseb106@gmail.com Daniel P. Delorey routey@deloreyfamily.org
More informationGene Ontology. Shifra Ben-Dor. Weizmann Institute of Science
Gene Ontology Shifra Ben-Dor Weizmann Institute of Science Outline of Session What is GO (Gene Ontology)? What tools do we use to work with it? Combination of GO with other analyses What is Ontology? 1700s
More informationCHAPTER 4: DATASETS AND CRITERIA FOR ALGORITHM EVALUATION
CHAPTER 4: DATASETS AND CRITERIA FOR ALGORITHM EVALUATION 4.1 Overview This chapter contains the description about the data that is used in this research. In this research time series data is used. A time
More informationMATH2206 Prob Stat/20.Jan Weekly Review 1-2
MATH2206 Prob Stat/20.Jan.2017 Weekly Review 1-2 This week I explained the idea behind the formula of the well-known statistic standard deviation so that it is clear now why it is a measure of dispersion
More informationToward a Proof of the Chain Rule
Toward a Proof of the Chain Rule Joe Gerhardinger, jgerhardinger@nda.org, Notre Dame Academy, Toledo, OH Abstract The proof of the chain rule from calculus is usually omitted from a beginning calculus
More information25 : Graphical induced structured input/output models
10-708: Probabilistic Graphical Models 10-708, Spring 2013 25 : Graphical induced structured input/output models Lecturer: Eric P. Xing Scribes: Meghana Kshirsagar (mkshirsa), Yiwen Chen (yiwenche) 1 Graph
More informationREX - A TOOL FOR DISCOVERING EVOLUTION TRENDS
REX - A TOOL FOR DISCOVERING EVOLUTION TRENDS IN ONTOLOGY REGIONS VICTOR CHRISTEN, ANIKA GROSS, MICHAEL HARTUNG 18 TH JULY 2014, DILS, LISBOA 1 ONTOLOGY EVOLUTION Heavy usage of ontologies in the life
More informationLatent Semantic Analysis. Hongning Wang
Latent Semantic Analysis Hongning Wang CS@UVa Recap: vector space model Represent both doc and query by concept vectors Each concept defines one dimension K concepts define a high-dimensional space Element
More informationRECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION
RECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION MAGNUS BORDEWICH, KATHARINA T. HUBER, VINCENT MOULTON, AND CHARLES SEMPLE Abstract. Phylogenetic networks are a type of leaf-labelled,
More informationEquality of P-partition Generating Functions
Bucknell University Bucknell Digital Commons Honors Theses Student Theses 2011 Equality of P-partition Generating Functions Ryan Ward Bucknell University Follow this and additional works at: https://digitalcommons.bucknell.edu/honors_theses
More informationConnectedness. Proposition 2.2. The following are equivalent for a topological space (X, T ).
Connectedness 1 Motivation Connectedness is the sort of topological property that students love. Its definition is intuitive and easy to understand, and it is a powerful tool in proofs of well-known results.
More informationDr. Amira A. AL-Hosary
Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological
More informationPart III: Unstructured Data. Lecture timetable. Analysis of data. Data Retrieval: III.1 Unstructured data and data retrieval
Inf1-DA 2010 20 III: 28 / 89 Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval Statistical Analysis of Data: III.2 Data scales and summary statistics III.3 Hypothesis
More informationPath Graphs and PR-trees. Steven Chaplick
Path Graphs and PR-trees by Steven Chaplick A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Computer Science University of Toronto Copyright
More informationpursues interdisciplinary long-term research in Spatial Cognition. Particular emphasis is given to:
The Transregional Collaborative Research Center SFB/TR 8 Spatial Cognition: Reasoning, Action, Interaction at the Universities of Bremen and Freiburg, Germany pursues interdisciplinary long-term research
More informationPattern Popularity in 132-Avoiding Permutations
Pattern Popularity in 132-Avoiding Permutations The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published Publisher Rudolph,
More informationLeveraging Data Relationships to Resolve Conflicts from Disparate Data Sources. Romila Pradhan, Walid G. Aref, Sunil Prabhakar
Leveraging Data Relationships to Resolve Conflicts from Disparate Data Sources Romila Pradhan, Walid G. Aref, Sunil Prabhakar Fusing data from multiple sources Data Item S 1 S 2 S 3 S 4 S 5 Basera 745
More informationExamining the accuracy of the normal approximation to the poisson random variable
Eastern Michigan University DigitalCommons@EMU Master's Theses and Doctoral Dissertations Master's Theses, and Doctoral Dissertations, and Graduate Capstone Projects 2009 Examining the accuracy of the
More informationClustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden
Clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein Some slides adapted from Jacques van Helden Gene expression profiling A quick review Which molecular processes/functions
More informationAlgorithm-Independent Learning Issues
Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning
More informationClustering & microarray technology
Clustering & microarray technology A large scale way to measure gene expression levels. Thanks to Kevin Wayne, Matt Hibbs, & SMD for a few of the slides 1 Why is expression important? Proteins Gene Expression
More informationarxiv: v1 [cs.ds] 3 Feb 2018
A Model for Learned Bloom Filters and Related Structures Michael Mitzenmacher 1 arxiv:1802.00884v1 [cs.ds] 3 Feb 2018 Abstract Recent work has suggested enhancing Bloom filters by using a pre-filter, based
More informationX X (2) X Pr(X = x θ) (3)
Notes for 848 lecture 6: A ML basis for compatibility and parsimony Notation θ Θ (1) Θ is the space of all possible trees (and model parameters) θ is a point in the parameter space = a particular tree
More informationTHESIS. Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University
The Hasse-Minkowski Theorem in Two and Three Variables THESIS Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By
More informationChris Bishop s PRML Ch. 8: Graphical Models
Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular
More informationProbabilistic Graphical Models
Probabilistic Graphical Models David Sontag New York University Lecture 4, February 16, 2012 David Sontag (NYU) Graphical Models Lecture 4, February 16, 2012 1 / 27 Undirected graphical models Reminder
More informationLecture 8: Conditional probability I: definition, independence, the tree method, sampling, chain rule for independent events
Lecture 8: Conditional probability I: definition, independence, the tree method, sampling, chain rule for independent events Discrete Structures II (Summer 2018) Rutgers University Instructor: Abhishek
More informationGraphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence
Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence General overview Introduction Directed acyclic graphs (DAGs) and conditional independence DAGs and causal effects
More informationAmira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut
Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological
More informationCS1800: Mathematical Induction. Professor Kevin Gold
CS1800: Mathematical Induction Professor Kevin Gold Induction: Used to Prove Patterns Just Keep Going For an algorithm, we may want to prove that it just keeps working, no matter how big the input size
More information4CitySemantics. GIS-Semantic Tool for Urban Intervention Areas
4CitySemantics GIS-Semantic Tool for Urban Intervention Areas Nuno MONTENEGRO 1 ; Jorge GOMES 2 ; Paulo URBANO 2 José P. DUARTE 1 1 Faculdade de Arquitectura da Universidade Técnica de Lisboa, Rua Sá Nogueira,
More informationUnderstanding Interlinked Data
Understanding Interlinked Data Visualising, Exploring, and Analysing Ontologies Olaf Noppens and Thorsten Liebig (Ulm University, Germany olaf.noppens@uni-ulm.de, thorsten.liebig@uni-ulm.de) Abstract Companies
More informationRow and Column Distributions of Letter Matrices
College of William and Mary W&M ScholarWorks Undergraduate Honors Theses Theses, Dissertations, & Master Projects 5-2016 Row and Column Distributions of Letter Matrices Xiaonan Hu College of William and
More information