Conceptual Similarity: Why, Where, How

Similar documents
GENRE ONTOLOGY LEARNING: COMPARING CURATED WITH CROWD-SOURCED ONTOLOGIES

Latent Semantic Analysis. Hongning Wang

The OntoNL Semantic Relatedness Measure for OWL Ontologies

Text and multimedia languages and properties

Performance Metrics for Machine Learning. Sargur N. Srihari

Exploiting WordNet as Background Knowledge

FROM QUERIES TO TOP-K RESULTS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING. Crista Lopes

Semantic Similarity and Relatedness

Latent Semantic Analysis. Hongning Wang

Key Words: geospatial ontologies, formal concept analysis, semantic integration, multi-scale, multi-context.

Information Retrieval Tutorial 6: Evaluation

Performance Evaluation

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

Bayesian Decision Theory

CSC 411: Lecture 03: Linear Classification

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Metric Learning. 16 th Feb 2017 Rahul Dey Anurag Chowdhury

OWL Semantics COMP Sean Bechhofer Uli Sattler

A metric approach for. comparing DNA sequences

Manning & Schuetze, FSNLP, (c)

Overview of IslandPick pipeline and the generation of GI datasets

Darányi, S.*, Wittek, P.*, Konstantinidis, K.**

On Algebraic Spectrum of Ontology Evaluation

Manning & Schuetze, FSNLP (c) 1999,2000

Peter Wood. Department of Computer Science and Information Systems Birkbeck, University of London Automata and Formal Languages

Computation Theory Finite Automata

Application of Jaro-Winkler String Comparator in Enhancing Veterans Administrative Records

Introduction to Supervised Learning. Performance Evaluation

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning for natural language processing

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Notes on Latent Semantic Analysis

Evaluation & Credibility Issues

Sequence modelling. Marco Saerens (UCL) Slides references

2 - Strings and Binomial Coefficients

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Section Summary. Relations and Functions Properties of Relations. Combining Relations

Definition: A binary relation R from a set A to a set B is a subset R A B. Example:

University of New Mexico Department of Computer Science. Final Examination. CS 561 Data Structures and Algorithms Fall, 2013

A practical introduction to active automata learning

Model Accuracy Measures

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Deterministic Finite Automaton (DFA)

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Learning Textual Entailment using SVMs and String Similarity Measures

13 Searching the Web with the SVD

Multiclass Multilabel Classification with More Classes than Examples

Exam EDAF August Thore Husfeldt

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

PV211: Introduction to Information Retrieval


STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler

Advanced Implementations of Tables: Balanced Search Trees and Hashing

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

CS 208: Automata Theory and Logic

Lecture 5: Web Searching using the SVD

From Research Objects to Research Networks: Combining Spatial and Semantic Search

MTAT Complexity Theory October 13th-14th, Lecture 6

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Least Squares Classification

Optical Character Recognition of Jutakshars within Devanagari Script

Part I: Definitions and Properties

Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology

CPSC 467b: Cryptography and Computer Security

Against the F-score. Adam Yedidia. December 8, This essay explains why the F-score is a poor metric for the success of a statistical prediction.

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Generic Text Summarization

IBPS Forum. 2 P a g e

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

Computer Number Systems

CC283 Intelligent Problem Solving 28/10/2013

Citation for published version (APA): Andogah, G. (2010). Geographically constrained information retrieval Groningen: s.n.

Performance Evaluation and Comparison

CLRG Biocreative V

Relations. Relations of Sets N-ary Relations Relational Databases Binary Relation Properties Equivalence Relations. Reading (Epp s textbook)

1 The Basic Counting Principles

CS47300: Web Information Search and Management

Comparing whole genomes

Introduction to Computers & Programming

Shift Cipher. For 0 i 25, the ith plaintext character is. E.g. k = 3

Soundex distance metric

Diagnostics. Gad Kimmel

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.

Pre-sessional Mathematics for Big Data MSc Class 2: Linear Algebra

More Smoothing, Tuning, and Evaluation

Analysis and visualization of protein-protein interactions. Olga Vitek Assistant Professor Statistics and Computer Science

Chaos, Complexity, and Inference (36-462)

MDLP approach to clustering

Application of Associative Matrices to Recognize DNA Sequences in Bioinformatics

REGI. Claims REGIstry file: Dictionary of specific chemical compounds for crossfile searching with the IFI/Plenum patent databases

Semantic Similarity from Corpora - Latent Semantic Analysis

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

Transcription:

Conceptual Similarity: Why, Where, How Michalis Sfakakis Laboratory on Digital Libraries & Electronic Publishing, Department of Archives and Library Sciences, Ionian University, Greece First Workshop on Digital Information Management, March 30-31, 2011 Corfu, Greece

Do we need similarity? Are the following objects similar? (Similarity, SIMILARITY) As character sequences, NO! How do they differ? As character sequences, but case insensitive, Yes! As English words, Yes! Same word! They have the same definition, written differently

Contents Introduction Disciplines How we measure similarity Focus on Ontology Learning evaluation

Exploring similarity more cases What about the similarity of the objects? (1, a) The first object is the number one and the second is the first letter of the English alphabet. Therefore, as the first is a number and the second is a letter, they are different! But, conceptually When both represent an order, e.g. a chapter, or a paragraph number, they are both representing the first object of the list, the first chapter, paragraph, etc. Therefore, they could be considered as being similar!

Results for an Information Need How similar are the Results? Which one to select?

Comparing Concepts again, how similar are the following objects? (Disease, Illness) As English words, or as character sequences they are not similar! How do they differ? As synonymous terms in a Thesaurus, they are both representing the same concept. (related with the equivalency relationship)

Comparing Hierarchies How similar is the node car from the left hierarchy to the node auto from the right hierarchy? are the nodes van from both hierarchies? is the above hierarchies? * * [Dellschaft and Staab, 2006]

so, what similarity is? Similarity is a context dependent concept Merriam-Webster s Learner s dictionary defines similarity as * : A quality that makes one person or thing like another and similar, having characteristics in common Therefore, the context and the characteristics in common are required in order to specify and measure similarity * http://www.learnersdictionary.com/search/similarity

Where the concept of similarity is encountered Similarity is a context dependent concept Machine learning Ontology Learning Schema & Ontology Matching and Mapping Clustering IR in any evaluation concerning the results of a pattern recognition algorithm Vital part of the Semantic Web development

Precision & Recall in IR, measuring similarity between answers Let C be the result set for a query (the retrieved documents, i.e. the Computed set) Also, we need to know the correct results for the query (all the relevant documents, the Reference set) Precision: is the fraction of retrieved documents that are relevant to the search Recall: is the fraction of the documents that are relevant to the query that are successfully retrieved Wikipedia: http://en.wikipedia.org/wiki/precision_and_recall

Precision & Recall, a way to measure similarity Precision & Recall are two widely used metrics for evaluating the correctness of a pattern recognition algorithm Recall and Precision depend on the outcome (oval) of a pattern recognition algorithm and its relation to all relevant patterns (left) and the non-relevant patterns (right). The more correct results (green), the better. Precision: horizontal arrow. Recall: diagonal arrow. Wikipedia: http://en.wikipedia.org/wiki/precision_and_recall

Precision & Recall, once more D Precision P = R C / R False Negative False Positive Recall R = R C / C R R C C TP = R C TN = D (R C) FN = R C FP = C R True Negative True Positive

Overall evaluation, combining Precision & Recall Given Precision & Recall, F-measure could combines them for an overall evaluation Balanced F-measure (P & R are evenly weighted) F 1 = 2*(P*R)/(P+R) Weighted F-measure F b = (1+b 2 )*(P*R)/(b 2 *P+R), b non-zero F 1 (b=2) weights recall twice as much as precision F 0.5 (b=0.5) weights precision twice as much as recall

Measuring Similarity, Comparing two Ontologies A simplified definition of a core ontology * : The structure O := (C, root, C ) is called a core ontology. C is a set of concept identifiers and root is a designated root concept for the partial order C on C. This partial order is called concept hierarchy or taxonomy. The equation "c C : c C root holds for this concept hierarchy. Levels of comparison Lexical, how terms are used to convey meanings Conceptual, which conceptual relations exist between terms * [Dellschaft and Staab, 2006]

Gold Standard based Evaluation of Ontology Learning O R O C Given a pre-defined ontology The so-called Gold Standard or Reference Compare the Learned (Computed) Ontology with the Gold Standard

Measuring Similarity - Lexical Comparison Level LP, LR O R O C Lexical Precision & Lexical Recall LP(O C, O R ) = C C C R / C C LR(O C, O R ) = C C C R / C R The lexical precision and recall reflect how good the learned lexical terms C C cover the target domain C R For the above example LP=4/6=0.67, LR=4/5=0.8

Measuring Similarity, Lexical Comparison Level - asm Average String Matching, using edit distance Levenshtein distance, the most common definition for edit distance, measures the minimum number of token insertions, deletions and substitutions required to transform one string into an other For example *, the Levenshtein distance between "kitten" and "sitting" is 3 (there is no way to do it with fewer than three edits) kitten sitten (substitution of 's' for 'k') sitten sittin (substitution of 'i' for 'e') sittin sitting (insertion of 'g' at the end). * Wikipedia: http://en.wikipedia.org/wiki/levenshtein_distance

Measuring Similarity, Lexical Comparison Level String Matching String Matching measure (SM), given two lexical entries L 1, L 2 Weights the number of the required changes against the shorter string 1 stands for perfect match, 0 for bad match Average SM Asymmetric, determines the extend to which L 1 (target) is covered by L 2 (source) [Maedche and Staab, 2002]

Measuring Similarity, Lexical Comparison Level - RelHit RelHit actually express Lexical Precision RelHit Compared to average String Matching Average SM reduces the influences of string pseudo-differences (e.g. singular vs. plurals) Average SM may introduce some kind of noise, e.g. power, tower

Measuring Similarity, Conceptual Comparison Level Conceptual level compares semantic structure of ontologies Conceptual structures are constituted by Hierarchies, or by Relations How to compare two hierarchies? How do the positions of concepts influence similarity of Hierarchies? What measures to use?

Measuring Similarity, Conceptual Comparison Level O R O C Local measures compare the positions of two concepts based on characteristics extracts from the concept hierarchies they belong to Some characteristic extracts Semantic Cotopy (sc) sc(c, O) = {c i c i C (c i c cc i )} Common Semantic Cotopy (csc) csc(c, O 1, O 2 ) = {c i c i C 1 C 2 (c i < 1 c c < 1 c i )}

Measuring Similarity, Conceptual Comparison Level sc O R O C Semantic Cotopy sc(c, O) = {c i c i C (c i c cc i )} Semantic Cotopy examples sc( root, O R ) = {root, bike, car, van, coupé} sc( root, O C ) = {root, bike, auto, BMX, van, coupé} sc( bike, O R ) = {root, bike} sc( bike, O C ) = {root, bike, BMX} sc( car, O R ) = {root, car, van, coupé} sc( auto, O C ) = {root, auto, van, coupé}

Measuring Similarity, Conceptual Comparison Level csc O R O C Common Semantic Cotopy csc(c, O 1, O 2 ) = {c i c i C 1 C 2 (c i < 1 c c < 1 c i )} Common Semantic Cotopy examples C 1 C 2 = {root, bike, van, coupé} csc( root, O R, O C ) = {bike, van, coupé} csc( root, O C, O R ) = {bike, van, coupé} csc( bike, O R, O C ) = {root}, csc( bike, O C, O R ) = {root} csc( car, O R, O C ) = {root, van, coupé}, csc( car, O C, O R ) = csc( auto, O c, O R ) = {root, van, coupé} }, csc( auto, O C, O R ) =

Measuring Similarity, Conceptual Comparison Level local measures tp, tr O R O C Local taxonomic precision using characteristic extracts tp ce (c 1, c 2, O C, O R ) = ce(c 1, O C ) ce(c 1, O R ) / ce(c 1, O C ) Local taxonomic recall using characteristic extracts tr ce (c 1, c 2, O C, O R ) = ce(c 1, O C ) ce(c 1, O R ) / ce(c 1, O R )

Measuring Similarity, Conceptual Comparison Level local measures tp O R O C Local taxonomic precision examples using sc sc( bike, O R ) = {root, bike}, sc( bike, O C ) = {root, bike, BMX} tp sc ( bike, bike, O C, O R ) = {root, bike} / {root, bike, BMX}, tp sc ( bike, bike, O C, O R ) = 2/3 = 0.67 [Maedche and Staab, 2002]

Measuring Similarity, Conceptual Comparison Level local measures tp O R O C Local taxonomic precision examples using sc sc( car, O R ) = {root, car, van, coupé}, sc( auto, O C ) = {root, auto, van, coupé} tp sc ( car, auto, O C, O R ) = {root, van, coupé} / {root, auto, van, coupé}, tp sc ( car, auto, O C, O R ) = 3/4 = 0.75

Measuring Similarity, Conceptual Comparison Level comparing Hierarchies O R O C Global Taxonomic Precision (TP)

Measuring Similarity, Conceptual Comparison Level Overall evaluation again F-measure, but now using Global Taxonomic Precision (TP) and Global Taxonomic Recall (TR) Balanced Taxonomic F-measure (TP & TR are evenly weighted) TF 1 = 2*(TP*TR)/(TP+TR) Weighted TF-measure TF b = (1+b 2 )*(TP*TR)/(b 2 *TP+TR), b non-zero TF 1 (b=2) weights recall twice as much as precision TF 0.5 (b=0.5) weights precision twice as much as recall

Measuring Similarity, Conceptual Comparison Level Taxonomic Overlap Global Taxonomic Overlap based on local taxonomic overlap (TO)

References & Further Reading Dellschaft, Klaas and Staab, Steffen ( 2006) : On How to Perform a Gold Standard Based Evaluation of Ontology Learning. In: I. Cruz et al. (Eds.) ISWC 2006. LNCS 4273, pp. 228 241. Springer, Heidelberg Maedche, A., Staab, S. (2002): Measuring similarity between ontologies. In: Proceedings of the European Conference on Knowledge Acquisition and Management (EKAW-2002). Siguenza, Spain Staab, S. and Hotho, A.: Semantic Web and Machine Learning Tutorial (available at http://www.unikoblenz.de/~staab/research/events/icml05tutorial/icml05tuto rial.pdf) Bellahsene, Z., Bonifati, A., Rahm, E. (Eds.) (2011): Schema Matching and Mapping, Heidelberg, Springer- Verlag, ISBN 978-3-642-16517-7.

End of tutorial! Thanks for your attention! Michalis Sfakakis