Comparison between ontology distances (preliminary results)

Similar documents
Comparison between ontology distances (preliminary results)

Ontology similarity in the alignment space

Uncertain Reasoning for Creating Ontology Mapping on the Semantic Web. Miklos Nagy, Maria Vargas-Vera and Enrico Motta

Boolean and Vector Space Retrieval Models

Combining Sum-Product Network and Noisy-Or Model for Ontology Matching

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Information Retrieval and Web Search

Mapping of Science. Bart Thijs ECOOM, K.U.Leuven, Belgium

Information Retrieval

A conceptualization is a map from the problem domain into the representation. A conceptualization specifies:

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Ontology-Based News Recommendation

Exploiting WordNet as Background Knowledge

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

The OntoNL Semantic Relatedness Measure for OWL Ontologies

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Polionto: Ontology reuse with Automatic Text Extraction from Political Documents

Chemical Space: Modeling Exploration & Understanding

Extracting Modules from Ontologies: A Logic-based Approach

Knowledge Sharing. A conceptualization is a map from the problem domain into the representation. A conceptualization specifies:

An OWL Ontology for Quantum Mechanics

UNIVERSITY OF CALGARY. The Automatic Grouping of Sensor Data Layers

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

Web Ontology Language (OWL)

On Algebraic Spectrum of Ontology Evaluation

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models

An Ontology and Concept Lattice Based Inexact Matching Method for Service Discovery

SemaDrift: A Protégé Plugin for Measuring Semantic Drift in Ontologies

SEMANTIC ALIGNMENT OF DOCUMENTS WITH 3D CITY MODELS

Publishing Value Added EO Products as Linked Data

John Pavlopoulos and Ion Androutsopoulos NLP Group, Department of Informatics Athens University of Economics and Business, Greece

Scoring, Term Weighting and the Vector Space

A Survey of Temporal Knowledge Representations

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

1 Information retrieval fundamentals

Maschinelle Sprachverarbeitung

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

CAIM: Cerca i Anàlisi d Informació Massiva

Inference in Probabilistic Ontologies with Attributive Concept Descriptions and Nominals URSW 2008

SPARQL Rewriting for Query Mediation over Mapped Ontologies

An Analysis on Consensus Measures in Group Decision Making

Information Retrieval Basic IR models. Luca Bondi

Fielded Sequential Dependence Model for Ad-Hoc Entity Retrieval in the Web of Data

Theoretical approach to urban ontology: a contribution from urban system analysis

Data-Driven Logical Reasoning

2007 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes

Investigation of Latent Semantic Analysis for Clustering of Czech News Articles

Improved Algorithms for Module Extraction and Atomic Decomposition

Semantic Similarity from Corpora - Latent Semantic Analysis

Using C-OWL for the Alignment and Merging of Medical Ontologies

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Journal of Theoretical and Applied Information Technology 30 th November Vol.96. No ongoing JATIT & LLS QUERYING RDF DATA

Relaxed Precision and Recall for Ontology Matching

Ontologies for nanotechnology. Colin Batchelor

OWL Semantics COMP Sean Bechhofer Uli Sattler

An Algebra of Qualitative Taxonomical Relations for Ontology Alignments

Machine Learning for natural language processing

Information Retrieval

SPARQL Formalization

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Research Article Anatomy Ontology Matching Using Markov Logic Networks

DESIGNING A CARTOGRAPHIC ONTOLOGY FOR USE WITH EXPERT SYSTEMS

Resource Description Framework (RDF) A basis for knowledge representation on the Web

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Key Words: geospatial ontologies, formal concept analysis, semantic integration, multi-scale, multi-context.

Estimating query rewriting quality over LOD

DISTRIBUTIONAL SEMANTICS

Generating Sentences by Editing Prototypes

Lexical Analysis. DFA Minimization & Equivalence to Regular Expressions

Translating Ontologies from Predicate-based to Frame-based Languages

A HYBRID SEMANTIC SIMILARITY MEASURING APPROACH FOR ANNOTATING WSDL DOCUMENTS WITH ONTOLOGY CONCEPTS. Received February 2017; revised May 2017

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

vector space retrieval many slides courtesy James Amherst

Minimal Coverage for Ontology Signatures

What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.

Lecture 2: Axiomatic semantics

Ontological Modelling in Wikidata

Towards Linkset Quality for Complementing SKOS Thesauri

Lecture 2 August 31, 2007

Exploring Large Digital Library Collections using a Map-based Visualisation

Spatial Data on the Web Working Group. Geospatial RDA P9 Barcelona, 5 April 2017

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

CSE 158 Lecture 10. Web Mining and Recommender Systems. Text mining Part 2

Recap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors?

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Francisco M. Couto Mário J. Silva Pedro Coutinho

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

On Politics and Argumentation

Toponym Disambiguation using Ontology-based Semantic Similarity

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Descriptive Data Summarization

University of Florida CISE department Gator Engineering. Clustering Part 1

Semantics, ontologies and escience for the Geosciences

Machine Learning. Ludovic Samper. September 1st, Antidot. Ludovic Samper (Antidot) Machine Learning September 1st, / 77

Outline Introduction Background Related Rl dw Works Proposed Approach Experiments and Results Conclusion

Annotation algebras for RDFS

Transcription:

Comparison between ontology distances (preliminary results) Jérôme David, Jérôme Euzenat ISWC 2008 1 / 17

Context A distributed and heterogenous environment: the semantic Web Several ontologies on the same domain Problem: how to facilitate the communication between programs using different ontologies?? software program, peer, etc. software program, peer, etc. 2 / 17

Context A distributed and heterogenous environment: the semantic Web Several ontologies on the same domain Problem: how to facilitate the communication between programs using different ontologies? alignment = mediator software program, peer, etc. software program, peer, etc. 2 / 17

Context How to deal with many ontologies at Web scale? A solution: Distances between ontologies In peer-to-peer systems: find a peer using a close ontology In ontology engineering: find related ontologies for facilitating ontologies reuse etc. Objectives: Review some distances between ontologies distances issued from ontology matching Make a preliminary evaluation of these measures accuracy vs. speed 3 / 17

Various kinds of ontology distances Two kinds of ontology distances evaluated: Vectorial distances: ontologies are viewed as bags of terms 1. which weight scheme to use? 2. which vector distance to use? Entity-based distances: the distance between ontologies is function of the distance between their entities 1. entity distances: structural or/and lexical? 2. aggregating entity distance values: which collection distance to use? 4 / 17

Vectorial ontology distances RedWine Ontology Ontology vector... Médoc individual MyChateauMargauxBottle Château Margaux label comment ProducedIn GrapeVariety Producer Cantenac Médoc Ṁédoc wines are french red wine produced in the Bordeaux wine region. BordeauxArea Cabarnet Sauvignon Vintage Appellation Name City Merlot Margaux 1998 ChateauMargaux wine red médoc french bordeaux cabarnet sauvignon merlot margaux chateau cantenac 1998 grape appelation producer... 3 2 2 1 1 1 1... Vector measures Cosine Jaccard (Tanimoto index) Possible weights Boolean Frequency TF.IDF 5 / 17

Entity distances Label-based distance: String-distance between entity annotations : Jaro-Winkler Minimum Weight pairing + arithmetic mean Structural distances: OLA similarity Structural iterative similarity for OWL ontologies Triple-based iterative similarity Structural iterative similarity for RDF graphs Problem: How to aggregate entity measures values? concepts of o entity distance values concepts of o' 0.1 0.5 0.9 0.4 1 0.7 0.8 0.7 0.3? ontology distance value? 6 / 17

Collection measures Goal: compute ontology measure from concept measure values. Collection measures used Average linkage : the arithmetic mean of concept measure values. concepts of o' concepts of o 0.1 0.5 0.9 0.4 1 0.7 0.8 0.7 0.3 mean = 0.5 7 / 17

Collection measures Goal: compute ontology measure from concept measure values. Collection measures used Average linkage : the arithmetic mean of concept measure values. Hausdorff : Hausdorff (o, o ) = max(max e o min e o δ K (e, e ), max e o min e o δ K (e, e )) concepts of o' concepts of o 0.1 0.5 0.9 0.4 1 0.7 0.8 0.7 0.3 7 / 17

Collection measures Goal: compute ontology measure from concept measure values. Collection measures used Average linkage : the arithmetic mean of concept measure values. Hausdorff : Hausdorff (o, o ) = max(max e o min e o δ K (e, e ), max e o min e o δ K (e, e )) concepts of o' concepts of o 0.1 0.5 0.9 0.4 1 0.7 0.8 0.7 0.3 7 / 17

Collection measures Goal: compute ontology measure from concept measure values. Collection measures used Average linkage : the arithmetic mean of concept measure values. Hausdorff : Hausdorff (o, o ) = max(max e o min e o δ K (e, e ), max e o min e o δ K (e, e )) concepts of o' concepts of o 0.1 0.5 0.9 0.4 1 0.7 0.8 0.7 0.3 7 / 17

Collection measures Goal: compute ontology measure from concept measure values. Collection measures used Average linkage : the arithmetic mean of concept measure values. Hausdorff : Hausdorff (o, o ) = max(max e o min e o δ K (e, e ), max e o min e o δ K (e, e )) Minimum Weight Maximum Graph Matching distance: concepts of o' concepts of o 0.1 0.5 0.9 0.4 1 0.7 0.8 0.7 0.3 mean = 0.4 7 / 17

Selected Measures 12 measures evaluated: 3 vectorial measures: Jaccard: the most basic measure (proportion of common terms) Cosine + TF weights Cosine + TF IDF weights 3 entity measures: EntityLexicalMeasure TripleBasedEntitySim OLAEntitySim 9 entity based measures combined with 3 collections measures: AverageLinkage Hausdorff MWMGM 8 / 17

Benchmark suite OAEI benchmark test set: 1 reference ontology 101 and altered ontologies 6 kinds of alterations: Name: suppressed, randomized, synonyms, convention, other language Comment: suppressed, other language Specialization hierarchy: suppressed, expanded, flattened Instances: suppressed Properties: suppressed, discarded restrictions Classes: split, flattened Order between alterations Example on name: {suppressed, randomized} < { synonym, convention, other language } ontology with synonym names is closer to the reference one than ontology with no names 9 / 17

Benchmark suite Induced proximity order: 101 is closer to 225 than 228 103_101_104 207 225 230 231 distance 204 205 203 206 223 201 208 209 210 222 224 228 221 252 238 240 202 251 249 250 248 237 239 236 232 233 259 261 247 258 260 257 253 254 246 241 266 265 262 10 / 17

Benchmark suite Induced proximity order (with partial alterations): 11 / 17

Tests We performed 3 different tests: Order comparison on benchmark test set check if the similarity orders are compatible with the order induced by alterations For each o, o check if 101 < o < o δ(101, o) < δ(101, o ) - Test set is biased towards 1-1 matching: some measure favored 12 / 17

Tests We performed 3 different tests: Order comparison on benchmark test set check if the similarity orders are compatible with the order induced by alterations For each o, o check if 101 < o < o δ(101, o) < δ(101, o ) - Test set is biased towards 1-1 matching: some measure favored Different cardinality matching compare with related but different ontologies compare with unrelated ontologies 12 / 17

Tests We performed 3 different tests: Order comparison on benchmark test set check if the similarity orders are compatible with the order induced by alterations For each o, o check if 101 < o < o δ(101, o) < δ(101, o ) - Test set is biased towards 1-1 matching: some measure favored Different cardinality matching compare with related but different ontologies compare with unrelated ontologies Time consumption test compare the time consumption of evaluated measures 12 / 17

Order comparison Proportion of o, o satisfying the induced order: Measure Tests Passed (ratio) Original Enhanced MWMGM (EntityLexicalMeasure) 0.53 0.72 Hausdorff (EntityLexicalMeasure) NaN NaN AverageLinkage (EntityLexicalMeasure) 0.44 0.31 MWMGM (OLAEntitySim) 0.75 0.78 Hausdorff (OLAEntitySim) 0.75 0.65 AverageLinkage (OLAEntitySim) 0.79 0.74 MWMGM (TripleBasedEntitySim) 0.86 0.92 Hausdorff (TripleBasedEntitySim) 0.86 0.89 AverageLinkage (TripleBasedEntitySim) 0.82 0.91 CosineVM (TF) 0.82 0.92 CosineVM (TFIDF) 0.57 0.81 JaccardVM (TF) 0.71 0.87 best entity measure : TripleBasedEntitySim 13 / 17

Order comparison Proportion of o, o satisfying the induced order: Measure Tests Passed (ratio) Original Enhanced MWMGM (EntityLexicalMeasure) 0.53 0.72 Hausdorff (EntityLexicalMeasure) NaN NaN AverageLinkage (EntityLexicalMeasure) 0.44 0.31 MWMGM (OLAEntitySim) 0.75 0.78 Hausdorff (OLAEntitySim) 0.75 0.65 AverageLinkage (OLAEntitySim) 0.79 0.74 MWMGM (TripleBasedEntitySim) 0.86 0.92 Hausdorff (TripleBasedEntitySim) 0.86 0.89 AverageLinkage (TripleBasedEntitySim) 0.82 0.91 CosineVM (TF) 0.82 0.92 CosineVM (TFIDF) 0.57 0.81 JaccardVM (TF) 0.71 0.87 best entity measure : TripleBasedEntitySim best collection measure : MWMGM 13 / 17

Order comparison Proportion of o, o satisfying the induced order: Measure Tests Passed (ratio) Original Enhanced MWMGM (EntityLexicalMeasure) 0.53 0.72 Hausdorff (EntityLexicalMeasure) NaN NaN AverageLinkage (EntityLexicalMeasure) 0.44 0.31 MWMGM (OLAEntitySim) 0.75 0.78 Hausdorff (OLAEntitySim) 0.75 0.65 AverageLinkage (OLAEntitySim) 0.79 0.74 MWMGM (TripleBasedEntitySim) 0.86 0.92 Hausdorff (TripleBasedEntitySim) 0.86 0.89 AverageLinkage (TripleBasedEntitySim) 0.82 0.91 CosineVM (TF) 0.82 0.92 CosineVM (TFIDF) 0.57 0.81 JaccardVM (TF) 0.71 0.87 best entity measure : TripleBasedEntitySim best collection measure : MWMGM best vectorial measure : Cosine with TF weights 13 / 17

Order comparison Proportion of o, o satisfying the induced order: Measure Tests Passed (ratio) Original Enhanced MWMGM (EntityLexicalMeasure) 0.53 0.72 Hausdorff (EntityLexicalMeasure) NaN NaN AverageLinkage (EntityLexicalMeasure) 0.44 0.31 MWMGM (OLAEntitySim) 0.75 0.78 Hausdorff (OLAEntitySim) 0.75 0.65 AverageLinkage (OLAEntitySim) 0.79 0.74 MWMGM (TripleBasedEntitySim) 0.86 0.92 Hausdorff (TripleBasedEntitySim) 0.86 0.89 AverageLinkage (TripleBasedEntitySim) 0.82 0.91 CosineVM (TF) 0.82 0.92 CosineVM (TFIDF) 0.57 0.81 JaccardVM (TF) 0.71 0.87 best entity measure : TripleBasedEntitySim best collection measure : MWMGM best vectorial measure : Cosine with TF weights Best results: TripleBasedEntitySim with MWMGM Cosine with TF weights: basic measure but works well Hausdorff works only with TripleBasedEntitySim Is MWMGM favored by the 1-1 nature of this test? 13 / 17

Different cardinality matching Objective: Ensure that MWMGM is still the best measure when we use different cardinality ontologies Test set: A reference ontology: 101 Highly altered ontologies: 2xx Different but similar ontologies: 3xx Irrelevant ontologies: conference test set (could share some vocabulary with benchmark ontologies) Expected results: δ(101, 2xx) < δ(101, 3xx) < δ(101, conference) 14 / 17

Different cardinality matching Results: MWMGM still performs best, vectorial measures fail when ontologies are lexically altered Main misorderings: MWMGM Hausdorff : only one altered (250) with similar (3xx) and conference ontologies some similar (3xx) with conference ontologies with TripleBasedEntitySim: only 250 with similar (3xx) and a conference ontology with OLAEntitySim: 228, 248, 250, 251, 252 with 304 and conference ontologies AverageLinkage with LexicalEntitySim: does not work highly altered (2xx) and similar (3xx) ontologies Vectorial measures many altered (248, 250, 251, 252) ontologies with similar (3xx) ontologies 15 / 17

Time consumption Are measures useable in real-time applications? Vectorial measures are more time efficient Time consumption on benchmark test: Measure Total time Average time (s) per similarity value (s) MWMGM(EntityLexicalMeasure) 558 0.46 MWMGM (OLAEntitySim) 39 074 31.9 MWMGM (TripleBasedEntitySim) 7 950 6.49 Hausdorff (EntityLexicalMeasure) 451 0.37 Hausdorff (OLAEntitySim) 38 912 31.76 Hausdorff (TripleBasedEntitySim) 7 410 6.05 AverageLinkage (EntityLexicalMeasure) 444 0.36 AverageLinkage (OLAEntitySim) 38 995 31.83 AverageLinkage (TripleBasedEntitySim) 7 671 6.26 CosineVM (TF) 101 0.08 CosineVM (TFIDF) 102 0.08 JaccardVM (TF) 101 0.08 16 / 17

Time consumption Are measures useable in real-time applications? Vectorial measures are more time efficient Not significant differences between MWMGM (N 3 ), Hausdorff and AverageLinkage(N 2 ) Time consumption on benchmark test: Measure Total time Average time (s) per similarity value (s) MWMGM(EntityLexicalMeasure) 558 0.46 MWMGM (OLAEntitySim) 39 074 31.9 MWMGM (TripleBasedEntitySim) 7 950 6.49 Hausdorff (EntityLexicalMeasure) 451 0.37 Hausdorff (OLAEntitySim) 38 912 31.76 Hausdorff (TripleBasedEntitySim) 7 410 6.05 AverageLinkage (EntityLexicalMeasure) 444 0.36 AverageLinkage (OLAEntitySim) 38 995 31.83 AverageLinkage (TripleBasedEntitySim) 7 671 6.26 CosineVM (TF) 101 0.08 CosineVM (TFIDF) 102 0.08 JaccardVM (TF) 101 0.08 16 / 17

Time consumption Are measures useable in real-time applications? Vectorial measures are more time efficient Not significant differences between MWMGM (N 3 ), Hausdorff and AverageLinkage(N 2 ) Structural measures are not really useable in real-time applications Time consumption on benchmark test: Measure Total time Average time (s) per similarity value (s) MWMGM(EntityLexicalMeasure) 558 0.46 MWMGM (OLAEntitySim) 39 074 31.9 MWMGM (TripleBasedEntitySim) 7 950 6.49 Hausdorff (EntityLexicalMeasure) 451 0.37 Hausdorff (OLAEntitySim) 38 912 31.76 Hausdorff (TripleBasedEntitySim) 7 410 6.05 AverageLinkage (EntityLexicalMeasure) 444 0.36 AverageLinkage (OLAEntitySim) 38 995 31.83 AverageLinkage (TripleBasedEntitySim) 7 671 6.26 CosineVM (TF) 101 0.08 CosineVM (TFIDF) 102 0.08 JaccardVM (TF) 101 0.08 16 / 17

Conclusion Why use complex measures when basic measures performs well? for discriminating highly related ontologies More precise test set is needed Characterize confidence intervals Is there a threshold from which returned values are always correct? Other ontology distances measures: alignment based 17 / 17