Similarity for Conceptual Querying

Similar documents
A set theoretic view of the ISA hierarchy

On Tuning OWA Operators in a Flexible Querying Interface

Interpreting Low and High Order Rules: A Granular Computing Approach

Distributive Lattice Ordered Ontologies

A Generalized Decision Logic in Interval-set-valued Information Tables

An Analysis on Consensus Measures in Group Decision Making

From Constructibility and Absoluteness to Computability and Domain Independence

Maximal Introspection of Agents

Least Common Subsumers and Most Specific Concepts in a Description Logic with Existential Restrictions and Terminological Cycles*

Knowledge Discovery Based Query Answering in Hierarchical Information Systems

Incomplete Information in RDF

The Arithmetic Recursive Average as an Instance of the Recursive Weighted Power Mean

Expressive number restrictions in Description Logics

Probability of fuzzy events

FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH

Previous Accomplishments. Focus of Research Iona College. Focus of Research Iona College. Publication List Iona College. Journals

Sets with Partial Memberships A Rough Set View of Fuzzy Sets

A Goal-Oriented Algorithm for Unification in EL w.r.t. Cycle-Restricted TBoxes

Can Vector Space Bases Model Context?

Chapter 4: Computation tree logic

Decidability of SHI with transitive closure of roles

LTCS Report. Decidability and Complexity of Threshold Description Logics Induced by Concept Similarity Measures. LTCS-Report 16-07

7 RC Simulates RA. Lemma: For every RA expression E(A 1... A k ) there exists a DRC formula F with F V (F ) = {A 1,..., A k } and

Rough Sets. V.W. Marek. General introduction and one theorem. Department of Computer Science University of Kentucky. October 2013.

Fuzzy Sets and Fuzzy Techniques. Sladoje. Introduction. Operations. Combinations. Aggregation Operations. An Application: Fuzzy Morphologies

Distributive lattice-structured ontologies

arxiv: v1 [cs.lo] 16 Jul 2017

Review. Earley Algorithm Chapter Left Recursion. Left-Recursion. Rule Ordering. Rule Ordering

West Windsor-Plainsboro Regional School District Algebra Grade 8

On flexible database querying via extensions to fuzzy sets

Approximation Metrics for Discrete and Continuous Systems

UPPER AND LOWER SET FORMULAS: RESTRICTION AND MODIFICATION OF THE DEMPSTER-PAWLAK FORMALISM

Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events

Uncertainty and Rules

Sets. Alice E. Fischer. CSCI 1166 Discrete Mathematics for Computing Spring, Outline Sets An Algebra on Sets Summary

Safety Analysis versus Type Inference

Safety Analysis versus Type Inference for Partial Types

Content Standard 1: Numbers, Number Sense, and Computation Place Value

Estimating the Region of Attraction of Ordinary Differential Equations by Quantified Constraint Solving

Using quasiarithmetic means in a sequential decision procedure

Mining Approximative Descriptions of Sets Using Rough Sets

An Independence Relation for Sets of Secrets

Extensions to the Logic of All x are y: Verbs, Relative Clauses, and Only

ENCOURAGING THE GRAND COALITION IN CONVEX COOPERATIVE GAMES

Favoring Consensus and Penalizing Disagreement in Group Decision Making

Math 1 Variable Manipulation Part 5 Absolute Value & Inequalities

Properties of intuitionistic fuzzy line graphs

The matrix approach for abstract argumentation frameworks

Relations on Hypergraphs

On Another Decomposition of Fuzzy Automata

Model checking the basic modalities of CTL with Description Logic

MIDDLE GRADES MATHEMATICS

A Class of Star-Algebras for Point-Based Qualitative Reasoning in Two- Dimensional Space

Fuzzy Description Logics

SPATIAL DATA MINING. Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM

Uncertain Logic with Multiple Predicates

Honors Integrated Algebra/Geometry 3 Critical Content Mastery Objectives Students will:

Abstract. In this paper we present a query answering system for solving non-standard

Data Dependencies in the Presence of Difference

Bounds on the Traveling Salesman Problem

Designing and Evaluating Generic Ontologies

COMPLEXITY OF SHORT RECTANGLES AND PERIODICITY

This pre-publication material is for review purposes only. Any typographical or technical errors will be corrected prior to publication.

Independence, Decomposability and functions which take values into an Abelian Group

Multiattribute decision making models and methods using intuitionistic fuzzy sets

Solving Fuzzy PERT Using Gradual Real Numbers

Exploiting WordNet as Background Knowledge

THE FORMAL TRIPLE I INFERENCE METHOD FOR LOGIC SYSTEM W UL

Databases 2011 The Relational Algebra

Classification Based on Logical Concept Analysis

Handbook of Logic and Proof Techniques for Computer Science

Basic System and Subsystem Structures in the Dataflow Algebra. A. J. Cowling

Footnotes to Linear Algebra (MA 540 fall 2013), T. Goodwillie, Bases

Extended IR Models. Johan Bollen Old Dominion University Department of Computer Science

Use estimation strategies reasonably and fluently while integrating content from each of the other strands. PO 1. Recognize the limitations of

Intelligent GIS: Automatic generation of qualitative spatial information

SYMMETRIC SUBGROUP ACTIONS ON ISOTROPIC GRASSMANNIANS

Regular finite Markov chains with interval probabilities

College Algebra with Corequisite Support: Targeted Review

Normal Forms for Priority Graphs

Algebra II Crosswalk. Red font indicates a passage that is not addressed in the compared sets of standards.

Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events

THE FORMAL TRIPLE I INFERENCE METHOD FOR LOGIC SYSTEM W UL

On the decidability of termination of query evaluation in transitive-closure logics for polynomial constraint databases

Published in: Tenth Tbilisi Symposium on Language, Logic and Computation: Gudauri, Georgia, September 2013

Truth-Functional Logic

Size-Change Termination and Transition Invariants

Robust goal programming

CEX and MEX: Logical Diff and Semantic Module Extraction in a Fragment of OWL

Question Answering on Statistical Linked Data

Introduction to Kleene Algebras

Arizona Mathematics Standards Articulated by Grade Level (2008) for College Work Readiness (Grades 11 and 12)

SUPPORT VECTOR MACHINE

NEUTROSOPHIC PARAMETRIZED SOFT SET THEORY AND ITS DECISION MAKING

College Algebra with Corequisite Support: A Blended Approach

Introduzione alle logiche descrittive

More Model Theory Notes

Compenzational Vagueness

Approximations of Modal Logic K

A Relational Knowledge System

Transcription:

Similarity for Conceptual Querying Troels Andreasen, Henrik Bulskov, and Rasmus Knappe Department of Computer Science, Roskilde University, P.O. Box 260, DK-4000 Roskilde, Denmark {troels,bulskov,knappe}@ruc.dk Abstract. The focus of this paper is approaches to measuring similarity for application in content-based query evaluation. Rather than only comparing at the level of words, the issue here is conceptual resemblance. The basis is a knowledge base defining major concepts of the domain and may include taxonomic and ontological domain knowledge. The challenge for support of queries in this context is an evaluation principle that on the one hand respects the formation rules for concepts in the concept language and on the other is sufficiently efficient to candidate as a realistic principle for query evaluation. We present and discuss principles where efficiency is obtained by reducing the matching problem - which basically is a matter of conceptual reasoning - to numerical similarity computation. 1 Introduction The goal for concept-based querying, in text retrieval systems, is to include semantics, looking for improvements that can enhance the systems ability to generate ideal answers. In this context, one of the major problems is to determine the similarity between the semantic elements. It is no longer only simple match of keywords in the text objects, but also the meaning of them, we have to take into consideration when we calculate the similarity between queries and objects in our database. The foundation of this paper is our previous work[1] and our affiliation to the interdisciplinary research project OntoQuery 1 (Ontology-based Querying)[3, 4]. We assume that text objects are described by compound concepts in a description language and that queries are expressed in the language or can be transformed into this, and that these descriptions refer to a knowledge base clarifying the domain of the database. The environment for this type of querying may be a system that automatically can produce conceptual descriptions (conceptual indexing) of text objects and support textual/word list queries by initial transformation into descriptions. 1 The project has the following participating institutions: Centre for Language Technology, The Technical University of Denmark, Copenhagen Business School, Roskilde University and the University of Southern Denmark.

2 Concepts and concept language The first thing is therefore to describe the basis of this environment, what concepts are in this context, and the expressiveness of the languages used. The concept language Ontolog[5] is based on a set of atomic concepts which can be combined with semantic relations to form compound concepts. Expressions in Ontolog are descriptions of concepts situated in an ontology formed by an algebraic lattice with concept inclusion as the ordering relation. Attribution of concepts, i.e. combining atomic concepts into compound concepts, can be written as feature structures. Simple attribution of a concept c 1 with relation r and a concept c 2 is denoted c 1 [r : c 2 ]. We assume a set of atomic concepts A and a set of semantic relations R. Then the set of well-formed terms L of the Ontolog language is recursively defined as follows. if x A then x L if x L, r i R and y i L, i = 1,..., n then x[r 1 : y 1,..., r n : y n ] L It appears that compound terms can be built from nesting, for instance c 1 [r 1 : c 2 [r 2 : c 3 ]] and from multiple attribution as in c 1 [r 1 : c 2, r 2 : c 3 ]. The attributes of a multiple attributed term T = x[r 1 : y 1,..., r n : y n ] is considered as a set, thus we can rewrite T with any permutation of r 1 : y 1,..., r n : y n. The basis for the ontology is a simple taxonomic concept inclusion relation isa KB, which is atomic in the sense that it defines the hyponymy lattice over the set of atomic concepts A. It is considered as domain or world knowledge and may for instance express the view of a domain expert. Based on isa, the transitive closure of isa KB, we can generalize into a relation over all well-formed terms of the language L by the following. if x isa y then x y if x[...] y[...] then also x[..., r : z] y[...], and x[..., r : z] y[..., r : z], if x y then also z[..., r : x] z[..., r : y] where repeated... in each inequality denotes identical lists of zero or more attributes of the form r i : w i. The purpose of the language introduced above is to describe fragments of meaning in text at a more thoroughly way than what can by obtained from simple keywords, while still refraining from full meaning representations which is obviously not realistic in general search applications (with a huge database). Take as an example the noun phrase: the dark grey cat which can be translated into this semantic expression cat[chr: grey[chr: dark]]. Descriptions of text expressed in this language goes beyond simple keyword descriptions partly due to formation of compound terms and to the reference to the ontology. A key

question in the framework of querying is of course the definitions of similarity or nearness of terms, now that we no longer can rely on simple matching of keywords. 3 From Ontology to Similarity In building a query evaluation principle that draws on an ontology, a key issue is of course how the different relations of the ontology may contribute to similarity. We have to decide for each relation to what extent related values are similar and we must build similarity functions, mapping values into similarities, that reflect these decisions. We discuss firstly below how to introduce similarity based on the key ordering relation in the ontology, isa, as applied on atomic concepts (concepts explicitly represented in the ontology). Secondly we discuss how to extend the notion of similarity to cover not only atomic but general compound concepts as expressions in the language Ontolog. Finally we introduce a refinement that uses all possible relations, in the calculation of similarity. 3.1 Similarity on atomic concepts As discussed in [1] the hyponymy relation imply a strong similarity in the opposite direction of the inclusion (specialization), but also the direction of the inclusion (generalization) must contribute with some degree of similarity. Take as an example the small fraction of an ontology shown in figure 1a. With reference to this ontology the atomic concept dog can be directly expanded to cover also poodle and alsatian. The intuition is that to a query on dog an answer including instances poodle is satisfactory (a specific answer to a general query). Since the hyponymy relation obviously is transitive we can by the same argument expand to further specializations e.g. to include poodle in the extension of animal. However similarity exploiting the lattice should also reflect distance in the relation. Intuitively greater distance (longer path in the relation graph) corresponds to smaller similarity. Further also generalization should contribute to similarity. Of course it is not strictly correct in an ontological sense to expand the extension of dog with instances of animal, but because all dogs are animals, animals are to some degree similar to dogs. This substantiates that also a property of generalization similarity should be exploited and, for similar reasons as in the case of specializations, that also transitive generalizations should contribute with decreasing degree of similarity. To make distance influence similarity we need either to be able to distinguish explicitly stated, original references, from derived or to establish a transitive reduction isa REDUC of the isa relation. Similarity reflecting distance can then be measured from path-length in the isa REDUC lattice. A similarity function sim based on distance in isa REDUC dist(x, y) should have the properties: 1. sim: U U [0, 1], where U is the universe of concepts

anything anything 0.4...... 0.4 animal animal 0.9 0.4 0.9 0.4 0.9 0.4 bird cat dog bird cat dog poodle alsatian 0.9 0.4 0.9 0.4 a) b) poodle alsatian Fig. 1. Inclusion relation (ISA) with upwards reading, e.g. dog ISA animal, and the ontology transformed into a directed weighted graph, with the immediate specialization and generalization similarity being δ = 0.9 and γ = 0.4 respectively as weights. Similarity is derived as maximal (multiplicative) weighted path length, thus sim(poodle, alsatian) = 0.4 0.9 = 0.36. 2. sim(x, y) = 1 only if x = y 3. sim(x, y) < sim(x, z) if dist(x, y) < dist(x, z) By parameterizing with two factors δ and γ expressing similarity of immediate specialization and generalization respectively, we can define a simple similarity function: If there is a path from nodes (concepts) x and y in the hyponymy relation then it has the form P = (p 1,, p n ) where p i isa REDUC p i+1 or p i+1 isa REDUC p i for each i (1) with x = p 1 and y = p n. Given a path P = (p 1,, p n ), set s(p ) and g(p ) to the numbers of specializations and generalizations respectively along the path P thus: s(p ) = { i p i isa REDUC p i+1 } and g(p ) = { i p i+1 isa REDUC p i } (2) If P 1,, P m are all paths connecting x and y then the degree to which y is similar to x can be defined as { } sim(x, y) = max δ s(pj) γ g(p j) (3) j=1,...,m and therefore denotes the shortest path between concepts x and y in the hyponymy lattice. This similarity can be considered as derived from the ontology by transforming the ontology into a directional weighted graph, with δ as downwards and γ as upwards weights and with similarity derived as the product of the weights on the paths.

An atomic concept c can then be expanded to a fuzzy set, including x and similar values x 1, x 2,..., x n as in: x+ = 1/x + sim(x, x 1 )/x 1 + sim(x, x 2 )/x 2 + sim(x, x n )/x n (4) Thus for instance with δ = 0.9 and γ = 0.4 the expansion of the concepts dog, animal and poodle into sets of similar values would be: dog+ = 1/dog + 0.9/poodle + 0.9/alsatian + 0.4/animal poodle+ = 1/poodle+0.4/dog +0.36/alsatian+0.16/animal +0.144/cat animal+ = 1/animal + 0.9/cat + 0.9/dog + 0.81/poodle + 0.81/alsatian 3.2 General concept-similarity The semantic relations, used in forming concepts in the ontology, indirectly contribute to similarity through subsumption. For instance cat[chr: grey[chr: dark]] is subsumed by - and thus extensionally included in - each of the more general concepts cat[chr: grey[chr: dark]], cat[chr: grey] and cat. Thus with a definition of similarity covering atomic concepts, and in some sense reflecting the ordering concept inclusion relation, we can extend to similarity on compound concepts by a relaxation, which takes subsumed concepts into account when comparing descriptions. The principle can be considered to be a matter of subsumption expansion of a concept c, denoted ɛ(c). Any compound concept is expanded (or relaxed) into the set of subsuming concepts, thus ɛ(cat[chr: grey[chr: dark]]) is expanded to the set: {cat[chr: grey[chr: dark]], cat[chr: grey], cat} Corresponding to the fuzzy set {1/cat[CHR: grey[chr: dark]] + 1/cat[CHR: grey] + 1/cat} The similarity between compound concepts x and y can be expressed as the degree to which the intersection of ɛ(x) and ɛ(y) is included in ɛ(x). One approach to query-answering in this direction is to expand the description of the query along the ontology and the potential answer objects along subsumption. For instance a query on dog could be expanded to a query on similar values like: dog+ = {1/dog +... + 0.4/animal +...} and a potential answer object like cat[chr: grey[chr: dark]] would then be subsumption expanded as exemplified above. While not the key issue here, we should point out the importance of applying an appropriate averaging aggregation when comparing descriptions. It is essential that similarity based on subsumption expansion, exploits that for instance the degree to which c 1 [r 1 : c 2 ] is matching c 1 [r 1 : c 2 [r 2 : c 3 ]] is higher than the degree for c with no attributes is matching c 1 [r 1 : c 2 [r 2 : c 3 ]]. Approaches to aggregation that can be tailored to obtain these properties, based on order weighted averaging[6] and capturing nested structuring[7], are described in [2].

3.3 Possible Paths and Shared Nodes We have indicated above that the shortest path (3) between concepts in an ontology can be used as a measure for similarity. The intuition is that the similarity between cat[chr : grey] and dog[chr : grey] is larger than the similarity between either of these and bird[chr : yellow], because both cat and dog are characterized by being grey, whereas the bird is yellow. But we are not able to capture this if the basis for the similarity measure is subsumption expansion of the objects in the database and expansion of the query by means of the ontology. A possible solution to this problem is to introduce all relations in the calculation of similarity and thereby letting the attribution of concepts influence the similarity measure. This gives rise to at least the following considerations. Firstly we have to create the theoretical foundations for such a similarity measure and secondly the approach has the be scalable. Figure 2 shows the introduction of compound concepts and semantic relations, visualized as dotted edges. The general idea now is a similarity measure anything...... color animal yellow bird gray dog cat bird [CHR: yellow] dog [CHR: gray] cat [CHR: gray] Fig. 2. An example ontology covering colored animals between concepts c 1 and c 2 based upon the set of all nodes reachable from both concepts in the graph, representing the part of the ontology covering c 1 and c 2. These shared node reflect the similarity between concepts, both in terms of similar attribution and subsuming concepts. The inclusion of both the hyponymy relation and the attribution of concepts in the calculation of similarity increases the complexity. We therefore devise a similarity measure that utilizes a well-defined subset of the ontology for measuring similarity. To this end we define first the term-decomposition τ(c) and the upwards expansion ω(c) of a set of terms C. The term-decomposition is defined as the set of all subterms of c, which thus includes all concepts subsuming c and all attributes of subsuming concepts for c. The term-decomposition is defined as

follows: τ(c) = {x c x c y[r : x], x L, y L, r R} Consider as an example the decomposition of the term τ(disorder[cby : lack[w RT : vitaminc]]) = { disorder[cby : lack[w RT : vitaminc]], disorder[cby : lack]], disorder, lack[w RT : vitaminc], lack, vitaminc} The upwards expansion ω(c) of a set of terms C is the closure of C with respect to isa. ω(c) = {x x C y C, y ISA x} This expansion thus only adds atoms to C. We define further the upwards spanning subgraph (subontology) θ(c) for a set of concepts C = {c 1,..., c n } as the graph that appears when expanding the decomposition of C and connecting the resulting set of terms with edges corresponding to the isa relation and to the semantic relations used in attribution of elements in C. We define the triple (x, y, r) as the edge of type r from concept x to concept y. The nodes reachable from the expansion of the decomposition of a concept via the isa relation, α is defined as {(x, y, isa) x, y ω(τ(c)), x y z ω(τ(c)), z y, x z} And the nodes reachable from the expansion of the decomposition of a concept via the semantic relations, β is defined as {(x, y, r) x, y ω(τ(c)), r R, x[r : y] τ(c)} We can then define the subontology θ(c) as the union of α and β. θ(c) = α β Figure 2 is an example of such an subontology spanned by three terms. We can now define the set of shared nodes σ for two concepts c 1 and c 2 as the intersection of the expansion of the decomposition of the two terms. σ(c 1, c 2 ) = ω(τ(c 1 )) ω(τ(c 2 )) 4 Conclusion We have described different principles for measuring similarity between both atomic and compound concepts, all of which incorporate meta knowledge. 1)

Similarity between atomic concepts based on distance in the ordering relation of the ontology, concept inclusion (isa), 2) Similarity between general compound concepts based on subsumption expansion, and 3) Similarity between both atomic and general compound concepts based on shared nodes. The notion of measuring similarity as distance, either in the ordering relation or in combination with the semantic relations, seems to indicate a usable theoretical foundation for design of similarity measures. The inclusion of the attribution of concepts, by means of shared nodes, in the calculation of similarity, gives a possible approach for a measure that captures more details and at the same time scale to large systems. The purpose of similarity measures in connection with querying is of course to look for similar rather than for exactly matching values, that is, to introduce soft rather than crisp evaluation. As indicated through examples above one approach to introduce similar values is to expand crisp values into fuzzy sets including also similar values. Expansion of this kind, applying similarity based on knowledge in the knowledge base, is a simplification replacing direct reasoning over the knowledge base during query evaluation. The graded similarity is the obvious means to make expansion a useful - by using simple threshold values for similarity the size of the answer can be fully controlled. References [1] Bulskov, H., Knappe, R. and Andreasen, T.: On Measuring Similarity for Conceptual Querying, LNAI 2522, pp. 100-111 in T. Andreasen, A. Motro, H. Christiansen, H.L. Larsen (Eds.): Flexible Query Answering Systems 5th International Conference, FQAS 2002. Copenhagen, Denmark, October 27-29, 2002. Proceedings [2] Andreasen, T.: Query evaluation based on domain-specific ontologies. In NAFIPS 2001, 20th IFSA / NAFIPS International Conference Fuzziness and Soft Computing, pp. 1844-1849, Vancouver, Canada, 2001. [3] Andreasen, T., Nilsson, J. Fischer & Thomsen, H. Erdman: Ontology-based Querying, in Larsen, H.L. et al. (eds.) Flexible Query Answering Systems, Flexible Query Answering Systems, Recent Advances, Physica-Verlag, Springer, 2000. pp. 15-26. [4] Andreasen, T., Jensen, P. Anker, Nilsson, J. Fischer, Paggio, P., Pedersen, B. Sandford & Thomsen, H. Erdman: Ontological Extraction of Content for Text Querying, to appear in NLDB 2002, Stockholm, Sweden, 2002. [5] Nilsson, J. Fischer: A Logico-algebraic Framework for Ontologies ONTOLOG, in Jensen, P. Anker & Skadhauge, P. (eds.): Proceedings of the First International OntoQuery Workshop Ontology-based interpretation of NP s. Department of Business Communication and Information Science, University of Southern Denmark, Kolding, 2001. [6] Yager, R.R.: On ordered weighted averaging aggregation operators in multicriteria decision making, in IEEE Transactions on Systems, Man and Cybernetics, vol 18, 1988. [7] Yager, R.R.: A hierarchical document retrieval language, in Information Retrieval vol 3, Issue 4, Kluwer Academic Publishers pp. 357-377, 2000.