Relational Representation of ALN Knowledge Bases

Relational Representation of ALN Knowledge Bases Thomas Studer December 7, 2004 Abstract The retrieval problem for a knowledge base O and a concept C is to find all individuals a such that O entails C(a). We describe a method to represent a knowledge base O as an instance Ô of a relational database. To answer a retrieval query over O, we can execute a corresponding database query over Ô. The main problem of such an embedding is that the semantics of knowledge bases is charaterized by an open world assumption while traditional databases feature a closed world semantics. Our procedure is sound and complete for knowledge bases given in the description logic ALN. We present an application of this relational representation technique in the area of online analytical processing. 1 Introduction One of the main reasoning tasks for knowledge base systems is the retrieval problem [5, 26]: given a knowledge base O and a concept description C, find all individuals a such that O entails C(a). That is, given O and C, we are looking for all individuals a for which we can logically infer from O that a belongs to C. There is a trivial algorithm for this problem, namely to test for each individual occurring in O whether it is an instance of the concept C. Of course, such an algorithm is in general not feasible. Thus, special storage and reasoning techniques are needed in order to efficiently answer retrieval queries. Description Logics are expressive languages for knowledge representation [3]. They deal with unary predicates (concepts), representing collections of individuals, and binary predicates (so-called roles). A description logic knowledge base consists of a terminology, stating relationships between concepts, 1

and of a world description, making assertions about individuals. There are a number of systems available which support reasoning and persistent storage of such knowledge bases, for instance Jena [19], Sesame [8], DLDB [22] and knowler [10]. Often, such systems are implemented on a common RDBMS and a description logic reasoner like FaCT [18] or Racer [17]. We show that for a knowledge base O that is given in the description logic languages ALN, it is possible to compute an instance Ô of a relational database which represents O with respect to the retrieval problem. That is, for every concept description C, we can run a database query on the instance Ô which returns exactly the individuals belonging to C in O. Hence, we do not need a special description logic reasoning engine to answer retrival queries. Reasoning only takes place in the initial computation of the database instance. This instance is built by perfoming a kind of completion algorithm. The completion algorithm calculates the extensions of the base concepts which occur in the knowledge base. That means, for each base concept A, we determine the set of all individuals belonging to A. Thus, we can represent the extensions of the base concepts and roles of the knowledge base as instance of a relational database. The retrieval query for a complex concept description C can now be answered by a simple database query on this instance. The main technical difficulty in representing a description logic knowledge base as relational database instance is that description logics are characterized by an open world assumption whereas databases work with a closed world semantics [5]. As a consequence, absence of information in a database is interpreted as negative information while absence of information in description logics only means lack of knowledge. For instance, consider an knowledge base with a haschild role that contains only one tuple, say (Peter, Harry). Still, we are not allowed to infer 1.hasChild(Peter) from this knowledge base because there are models where the interpretation of haschild contains more tuples. We cannot identify negative information of our knowledge base with missing information in its relational representation; instead, we have to introduce special relations and terms in order to deal with negative information in the database instance. Vitoriá et al. [26] present an approach to the retrieval problem for ALN which is based on finite automata. Similar to our work, they perform a preprocessing step which produces a completion of the knowledge base. However, they build a finite automaton instead of constructing a database instance. This automaton can then be used to answer retrieval queries in a syntaxdirected way. Another important line of research in the context of retrieval queries is the study of efficient storage mechanisms. Pan and Heflin [22] provide a good overview over the different strategies to store RDF(S) data 2

in relational databases. The main techniques are the vertical representation [1] (employed by Jena [19]), the decomposition storage model [12] (used in the PARKA knowledge representation system [25]), and the hyprid approach (taken by DLDB [22]). Although ALN is not a very expressive description logic language, it still suffices for many applications. As Gasdoué and Rousset [16] notice, ontology designers often do not even exploit the full expressive power of ALN to model their application domain. In the context of web ontology languages, Antoniou and van Harmelen [2] also state that there are reasons to expect that most ontological knowledge will be of rather simple nature. We are going to present an application of our relational representation of knowledge bases in the context of online analytical processing (OLAP) [11, 20]. An OLAP database is built around facts which are analyzed in terms of a dimensional model where each dimension is structured by (multiple) hierarchies. For certain application domains, it may be appropriate to model this structural information as description logic knowledge base which can then be represented as a database instance. Thus, it becomes possible to employ the definable concepts of the description logic to group the facts of the OLAP database, thereby providing more expressive queries. Decidability issues of combining description logics and database query languages have been studied for example in [15, 21]. The rest of the paper is organized as follows. The next section introduces ALN knowledge bases. Section 3 presents the algorithm to construct the database representation of a given knowledge base. Soundness and Completeness of this construction are shown in Section 4. Section 5 discusses an application of the relational representation in the context of OLAP. Section 6 concludes the paper. 2 Preliminaries In this section, we first present syntax and semantics of the description logic ALN. In a second part, we introduce ALN knowledge bases which consist of a terminology and a world description. The syntax of ALN is given by the following rule, where A, B are used for 3

atomic concepts, R denotes a role and C, D stand for concept descriptions: C, D A (atomic concept) (top, bottom) A (atomic negation) C D (conjunction) R.C (value restriction) n.r n.r (number restrictions) The semantics of concept descriptions is given as follows. An interpretation I is a pair ( I, I) where I is a non-empty set of individuals called the domain of the interpretation and I is an interpretation function assigning to each atomic concept A a set A I I and to each role R a binary relation R I I I. The interpretation function is extended to concept descriptions by the following inductive definition: I := I, I := ( A) I := I \ A I (C D) I := C I D I ( R.C) I := {a I b((a, b) R I b C I )} ( n.r) I := {a I {b (a, b) R I } n} ( n.r) I := {a I {b (a, b) R I } n} where S denotes the cardinality of the set S. An knowledge base O consists of a terminology O T (called TBox) and a world description O A (called ABox). In this paper we will only consider finite knowledge bases. The terminology O T contains concept definitions and restricted inclusion statements. A concept definition is an expression of the form A := C where A is an atomic concept and C is a concept description. We assume that an atomic concept appears on the left hand side of at most one concept definition. That is, the set of definitions is unequivocal. We divide the atomic concepts occuring in O T into two sets, the name symbols that occur on the left-hand side of some concept definition and the base symbols that occur only on the right-hand side of the concept definitions. Let A and B be atomic concepts. We say A depends on B if B appears in the concept definition of A. A set of of concept definitions is called acyclic if there is no cycle in this dependency relation. For the rest of the paper will consider acyclic TBoxes only. An acyclic set of definitions can be unfolded by the following iterative process: each occurrence of a name symbol on the right-hand side of a definition is 4

replaced by the concept that it stands for. Since there is no cycle in the set of definitions, this process eventually stops. We end up with a terminology consiting soley of definitions of the form A := C where C contains only base symbols and no name symbols. Since we work in ALN, we need an additional constraint to make this unfolding possible. We assume that no name symbol occurs negatively in O. Otherwise we could have a terminology consisting of A := B and B := C D. This would result in A := (C D) which is not an ALN concept description. The allowed inclusion statements are of the form A B and A B where A and B are base symbols, that is atomic concepts for which no definition is included in the TBox. The inclusion statements of the TBox have to be acyclic. There is no chain L 0, L 1, L 2,..., L n such that L 0 is the same concept as L n and all statements L i L i+1 are included in the TBox. An interpretation I satisfies O T if for every inclusion C D in O T we have C I D I and if for every definition A := C in O T we have A I = C I. A world description O A is a set of assertions about the extension of concepts and roles. In the ABox we introduce a set of individual constants a, b, c,... and we assert properties of these individuals. We can make the following two kinds of assertions in an ABox: C(a) and R(b, c) where C is a concept and R is a role. The first kind, called concept assertion, states that a belongs to (the interpretation of) C. The second kind, called role assertion, states that b and c are related by R. We give semantics for the ABox by extending interpretations to individual constants as follows. Let I be ( I, I) where the interpretation function I not only maps atomic concepts and roles to sets and relations, but in addition maps each individual constant a to an element a I of I. We assume that distinct individual constants denote distinct objects in I. Therefore, the mapping I respects the unique name assumption, that is if a and b are different individual constants, then we have a I b I. The interpretation I satisfies the concept assertion C(a) if a I C I and it satisfies the role assertion R(b, c) if (b I, c I ) R I. An interpretation satisfies an ABox O A if it satisfies each assertion in it. We call an interpretation I a model of a knowledge base O, if I satisfies O T as well as O A. We will use the following notation. I = O means that I is a model of the knowledge base O. We call a knowledge base O satisfiable if there exists an interpretation I such that I = O. We say O entails C(a), formally O = C(a), if every model I of O satisfies C(a). Similarly, O = R(a, b) states that every model I of O satisfies R(a, b). By O = C(a) and O = R(a, b) we mean that there exists a model I of O which does not satisfy C(a) and R(a, b), respectively. Example 1. Let us have a look at two examples of entailment. 5

Let O be a knowledge base containing the assertions 1.R(a), R(a, b) and C(b). We find that O = R.C(a). Let O be a knowledge base containing R.A(a), R. B(a) and having A B in its TBox. We find that O = 0.R(a). 3 Building the database instance We are going to present the construction of the database instance representing an ALN knowledge base. We will consider normalized knowledge bases. The normalization of a knowledge base O is achieved by the following steps: 1. Add the inclusion statement B A for each A B in O. 2. Replace each statement A B by the two inclusion statements A B and B A. 3. If there is a chain L 0, L 1, L 2,..., L n of concepts such that L n is the negation of L 0 and all statements L i L i+1 are included in O, then add L 0. 4. If there is a concept L such that L is an inclusion statement of O, then replace L with in O. 5. Fully unfold the concept definitions of O, so that no name symbol occurs on the rigth-hand side of a concept definition. Then we exhaustively apply the following normalization rule R.(C D) R.C R.D. Now replace each assertion C(a) O by C (a) where C is built from C by substituting each name symbol with its (unfolded and normalized) definition. Observe that in a normalized knowledge base, all atomic concepts are base symbols. If an atomic concept has a definition, then it has been replaced by its definition during the process of normalization. Hence, the assertional part of a normalized knowledge base contains only concept descriptions built from base symbols. In the sequel, O will always denote such a normalized knowledge base. Given a knowledge base O, let C be the set of all individuals occurring in O. Now we define a set Ĉ of individual constants which will be the domain of our database instance. We need some auxiliary notions. The role depth 6

rd(c) of a concept C is given by: rd( ) = rd( ) = rd(a) = rd( A) = 0, rd(c D) = max(rd(c), rd(d)), rd( R.C) = rd(c) + 1 and rd( n.r) = rd( n.r) = 1. Let rd(o) be the maximum of all rd(c) for concept expressions C occurring in O. Let k be the biggest natural number n such that a number restriction n.r or n.r occurs in O. The set Ĉ is given by: every individual constant a C is included in for a C. Ĉ. We define rd(a) := 0 new constants R,a,i are added to Ĉ for all role names R, for all i with 1 i k and for all a Ĉ with rd(a) < rd(o). We define rd( R,a,i ) := rd(a) + 1. new constants R,a are added to Ĉ for all role names R and for all a Ĉ with rd(a) < rd(o). We define rd( R,a ) := rd(a) + 1. Note that since O is finite, C is finite. Thus, Ĉ is also finite because of the role depth restrictions in its definition. Next, we are going to define the relation O α s where s is an ABox assertion and α is a natural number. The intended meaning of O α s is s follows from O in at most α steps. We will employ this relation to compute the atomic concept assertions which are entailed by O. They will serve as basis to build the database instance we are aiming at. 1. O 0 C(a) if C(a) O. 2. O 0 (a), for all a Ĉ. 3. O 0 A(a), for all a Ĉ if A O 4. O 0 R(a, b) if R(a, b) O. 5. O 0 R(a, R,a ) if R,a Ĉ. 6. O α+1 C(a) if O α C D(a) or O α D C(a). 7. O α+1 C(a) if there exists a constant b Ĉ such that O R.C(b) α and O α R(b, a). 8. O α+1 C(a) if O α D(a) and D C O. 9. O α+1 R(a, R,a,i ) if the following two conditions hold: O α n.r(a) and 7

1 i n {x C R(a, x) O}. 10. O α+1 R.C(a) if O α R.D(a) as well as D C O. 11. O α+1 0.R(a) if we have O α R.A(a) and O α R. A(a) for some base concept A 12. O α+1 0.R(a) if we have O α R. n.r 2 (a) and O α R. m.r 2 (a) for some role R 2 and natural numbers n and m with n < m. 13. O α+1 0.R(a) if O α R. (a) 14. O α+1 C(a) if O α C(a) 15. O α+1 R(a, b) if O α R(a, b) We will write O C(a) and O R(a, b) for α.o α C(a) and α.o α R(a, b), respectively. Let us briefly discuss the different cases in the definition of O α s. Cases 1 to 4 deal with assertions directly included in O. In case 5 we include special constants R,a to the extension of a role R. If we later have O R.C(a), then this implies O C( R,a ) (see case 7). At the end, we get that if R,a is in the answer to a query about the concept C in our database instance, then O entails R.C(a). Cases 6 to 8 compute the immediate consequences of assertions of the form C D(a), R.C(a) and C D. Case 9 treats the consequences of n.r(a). Such an assertion demands the existence of a certain number of elements of the form (a, x) in R. However, it does not impose any condition on what x has to be. So we can simply include new individual constants R,a,i in order to witness these existential commitments. We only have to take care that we do not include too many new elements. This could contradict a statement of the form n.r(a) O. Cases 10 to 13 deal with the relationship between statements of the form 0.R(a) and the constructor for value restriction R (see Example 1). Cases 14 and 15 finally are included to make the definition monotone with respect to α. 8

Note that the algorithm does nothing in the case n.r(a) O. An assertion of this form does not enforce the existence of more elements in the interpretation of R. However, it may imply facts of the form R.C(a). This will be reflected in Definition 3 where (a, R,a ) may be included in the extension of the complement R of R. Using standard arguments [6, 14], we can show that the algorithm computing O s terminates. Theorem 2. Assume O is finite. There exists a natural number α depending on O such that 1. O C(a) O α C(a), 2. O R(a, b) O α R(a, b). Definition 3. We define the relational database instance Ô by: A(a) is included in Ô if O A(a) for a base symbol A, A(a) is included in Ô if O A(a) for a base symbol A, R(a, b) is included in Ô if O R(a, b) and b R,a. R(a, b) is included in Ô if O n.r(a) and b is one of s 1,..., s m which are defined as follows. Let S := {x Ĉ O R(a, x)}. Let s 1,..., s l be an enumeration of S. Let k be the least natural number such that O k.r(a). The index m is defined as Ĉ k. If the cardinality l of S is only Ĉ k 1, then we let s m be the constant R,a. The relations A and R in Ô are needed in order to deal with negated concepts and negative information about roles. Example 4. Definition 3 shows why we introduced terms R,a,i in Ĉ not only for the at least statements but also for the at most statements in O. For an assertion like n.r(a) nothing happens in the definition of O α. However, we have to guarantee that there are enough elements that may be included in R. Assume O consists of the statements 1.R(a), R(a, a) and A(a). Then Ĉ will be the set {a, R,a, R,a,1 }. Therefore Ĉ = 3. In Definition 3 we will have S = { R,a,1 } since O R(a, R,a ). Hence l = 1, whereas the index m is defined as Ĉ k which is 2. We find that l = Ĉ k 1 holds, therefore we let s 2 := R,a. Finally we have R(a, R,a,1 ) Ô and R(a, R,a) Ô. Definition 5. A knowledge base O contains a clash if one of the following three conditions holds. 9

1. There exist an atomic concept A and an individual a Ĉ such that both A(a) and A(a) are elements of Ô. 2. There exists an individual a Ĉ such that (a) Ô. 3. There exist a role name R, a natural number n, an individual a Ĉ and n + 1 distinct individuals b i Ĉ such that O n.r(a) and R(a, b i ) Ô for 1 i n + 1. A knowledge base is called clash free if it does not contain a clash. We obtain the following standard results concerning clashes [6, 14]. Lemma 6. If a knowledge base O has a model, then O is clash free. Definition 7. The canonical interpretation I(O) induced by a knowledge base O is defined as (Ĉ, I) where a I := a A I := {x Ĉ A(x) Ô} R I := {(x, y) Ĉ Ĉ R(x, y) Ô} Lemma 8. Assume O is clash free. Then I(O) is a model of O. 4 Soundness and completeness We introduce a special consequence relation Ô = DB C(a) for our database instance Ô and ALN concept descriptions C. This gives us a translation of description logic retrieval queries to queries on our relational database instance. Definition 9. The concequence relation = DB is defined by: 1. Ô = DB A(a) if A(a) Ô 2. Ô = DB A(a) if A(a) Ô 3. Ô = DB C D(a) if Ô = DB C(a) and Ô = DB D(a) 4. Ô = DB R.C(a) if Ô = DB C( R,a ) or R(a, R,a ) Ô x.(r(a, x) Ô Ô = DB C(x)) 5. Ô = DB n.r(a) if {x R(a, x) Ô} n 10

6. Ô = DB n.r(a) if x Ĉ.R(a, x) Ô and {x R(a, x) Ô} Ĉ n First we are going to show completeness of this relational representation. We need the following lemmas. Lemma 10. Let A be an atomic concept description. Assume O is clash free and Ô = DB A(a). Then O {A(a)} is clash free. Proof. Looking at the definition of O α, we observe that the addition of A(a) to O can only invoke the rule for A C statements. Therefore we get that if O C(b) and O {A(a)} C(b), then b must be the individual a and C must be an atomic concept or the negation of an atomic concept. Suppose now that O {A(a)} is not clash free. (1) We assume for example O {A(a)} B(a) and O {A(a)} B(a). Since O is clash free, we must have O B(a) or O B(a). We investigate only the first case. The second can be treated analogously. If O B(a), then there exists a chain A L 0, L 1, L 2,..., L n B such that all statements L i L i+1 are included in O. We distinguish two cases. O B(a). In this case, there must exist a chain B L n... L 1 L 0 A in O. Then O B(a) implies O A(a). This yields Ô = DB A(a) which gives a contradiction. O B(a). Since we assumed O {A(a)} B(a), there must exist also a chain A M 0 M 1 M 2... M m B in O. Then there also is a chain in O. Hence, we find a chain B M m... M 1 M 0 A A L 1... L n 1 B M m 1... M 1 A in O. This implies A O which gives us A(a) Ô. We finally conclude Ô = DB A(a) which contradicts our assumptions. 11

We conclude that (1) cannot hold. free. Therefore O {A(a)} must be clash Definition 11. Let Z be a set of constants of the form R,a. We define I(O, Z) to be interpretation arising from I(O) by interpreting role names as follows: R I(O,Z) := R I(O) {(x, R,x ) R,x Z}. Lemma 12. Let O be clash free and assume that a and R are such that R(a, R,a ) Ô. Let X be a set of statements of the form B(b) and Z as in the previous definition. Assume I(O X, Z) = O. We find that I(O X, Z { R,a }) = O. Proof. Let I be I(O X, Z { R,a }). There are two critical cases: R.C(a) O. Assume (a I, x I ) R I. We have to show I = C(x). There are two cases: 1. If (a I, x I ) R I(O X,Z), then we find by I(O X, Z) = O that I(O X, Z) = C(x) which gives I = C(x). 2. If (a I, x I ) R I(O X,Z), then we get x = R,a. By the definition of O α, we have O R(a, R,a ). Therefore we find O C( R,a ) which yields I = C( R,a ). n.r(a) O. Suppose that already {x Ĉ (a, x) RI(O X,Z) } = n and R,a Z. The sets X and Z cannot contribute to {x Ĉ (a, x) RI(O X,Z) }. So we find {x Ĉ (a, x) RI(O) } = n. O R(a, R,a ). This gives us Additionally, we have {x Ĉ O R(a, x)} = n + 1. Hence we get {x Ĉ O R(a, x)} = Ĉ n 1. Thus, by Definition 3 we obtain R(a, R,a ) Ô which contradicts our assumptions. Therefore we have {x Ĉ (a, x) RI(O X,Z) } < n which finally yields {x Ĉ (a, x) RI } n. 12

Lemma 13. Let O be clash free and assume we have an individual a and a role name R with Ô = DB n.r(a). Then there exists a set X such that I(O X) = O and I(O X) = n.r(a). Proof. We let l := {x R(a, x) Ô}. If l > n holds, then we know that I(O) = n.r(a) and we are done. So assume l n. Let X := {R(a, x) x = R,a,i with l + 1 i n + 1}. We find I(O X) = n.r since {x R I(O X) (a, x)} = n + 1. Now we are going to show that O X is clash free. 1. Suppose there exist an atomic concept A and an individual b Ĉ such that both A(b) and A(b) are in Ô X. Since O is clash free, one of A(b) and A(b) does not belong to Ô. Assume A(b) Ô and A(b) Ô. The first assumption implies O R.A(a). (2) Since b is of the form R,a,i the second assumption is only possible if we have O j.r(a) for a j i. However, then we also have O R(a, R,a,i ) which gives us O A( R,a,i ) by (2). That is A(b) Ô. Contradiction. The case A(b) Ô and A(b) Ô is treated analogously. So assume both A(b) and A(b) do not belong to Ô. Then there exists a concept B such that there is a chain B... A in O and O R.B(a). This implies O R.A(a). In the same way we also obtain O R. A(a). Therefore we get O 0.R(a). Then we find {x R(a, x) Ô} = Ĉ. That is Ô = DB n.r(a) which contradicts our assumptions. 2. Suppose there exists b Ĉ with (b) Ô X. Since O is clash free, we must have O R. (a). This implies O 0.R(a). However, then we find {x R(a, x) Ô} = Ĉ. That is Ô = DB n.r(a) which contradicts our assumptions. 3. Suppose there exist k+1 many individual b i such that R(a, b i ) Ô X and k.r(a) Ô X. We find that already k.r(a) Ô and that k > n must hold because of Ô = DB n.r(a). However, by the construction of X we know that {x R(a, x) Ô X} = n + 1 which results in a contradiction. We finally conclude that I(O X) = O. 13

Lemma 14. Assume O is clash free. Let Ô = DB C(a). Then there exists a model I of O such that I = C(a) and I = I(O X, Z) where X be a set of statements B(b) and Z is a set of constants of the form R,a. Proof. Assume Ô = DB C(a). We proceed by induction on the structure of the concept C: C A: We have A(a) Ô. Therefore also I = A(a). C A: By Lemma 10 we get that O {A(a)} is clash free. Therefore, from Lemma 8 we infer that I(O {A(a)}) is a model for O {A(a)} and hence a model for O. Moreover, I(O {A(a)}) = A(a). C D E: We have either Ô = DB D(a) or Ô = DB E(a). Assume Ô = DB D(a). Then by the induction hypothesis there exists a model I of O with I = D(a). This also implies I = D E(a). C R.D: We have and R(a, R,a ) Ô x Ĉ.(R(a, x) Ô Ô = DB D(x)) (3) We distinguish the two cases for (3). 1. Assume Ô = DB D( R,a ). (4) R(a, R,a ) Ô. (5) By the induction hypothesis for (4) we obtain that there exist sets X and Z such that I(O X, Z) = O and I(O X, Z) = D( R,a ). Let I be I(O X, Z { R,a }). We also get I = D( R,a ). Since I = R(a, R,a ), we conclude I = R.D(a). Moreover, by (5) and Lemma 12 we find that I = O. 2. Assume there exists an x Ĉ such that R(a, x) Ô Ô = DB D(x). By the induction hypothesis we obtain that there exists a model I of O such that I = D(x). However, R(a, x) Ô implies I = R(a, x). Therefore we conclude I = R.D(a). C n.r: We find {x R(a, x) Ô} < n. {x (a, x) R I } < n. This implies I = n.r. Hence we also have 14

C n.r: By Lemma 13 we find a set X such that I(O X) = O and I(O X) = n.r(a). Thus we are done. As an immediate consequence of this lemma, we get the desired completeness result. Theorem 15. Let O be a satisfiable knowledge base. We have O = C(a) Ô = DB C(a). Now we are going to show soundness of our database instance. Every statement we can infer from the database instance is a consequence of the knowledge base. In order to prove soundness, we need a series of lemmas and definitions. Lemma 16. Assume a, b are individuals of C. 1. α(o α C(a) O = C(a)) 2. α(o α R(a, b) O = R(a, b)) Proof. For the second statement we first observe that O α R(a, b) implies R(a, b) O because of a, b C. Therefore we conclude O = R(a, b). We prove the second statement by induction on α. Assume there exists an α such that O α C(a). If O 0 C(a), then we have one of the following three cases. C(a) is an element of O. Hence, we conclude O = C(a). C. For every a C we know O = (a). C A and A O. The inclusion statement implies that O = A. Therefore we obtain O = A(a) for every a C. If O 0 C(a), then there exists β with O β C(a) and O β+1 C(a). We distiguish the following different cases. C D(a). By the induction hypothesis we have O = C D(a). Therefore we obtain O = C(a). D C(a). Analogous to the previous case. R.C(b) and O β R(b, a) for some b. Since a is in C and O β R(b, a), we obtain b C. By applying the induction hypothesis we get O = R.C(b) and O = R(b, a). Together this yields O = C(a). 15

D(a) and D C O. The induction hypothsis yields O = D(a). Together with O = D C we conclude O = C(a). R.D(a) and D C O. The induction hypothesis yields O = R.D(a). With O = D C we conclude O = R.C(a). R.A(a) and O β R. A(a). The induction hypothesis yields O = R.A(a) and O = R. A(a). These two facts together imply O = 0.R(a). R. n.r 2 (a) and O β R. m.r 2 (a) for n < m. The induction hypothesis yields O = R. n.r 2 (a) and O = R. m.r 2 (a). This implies O = 0.R(a). R. (a). The induction hypothesis yields O = R. (a). This also implies O = 0.R(a). We need some auxiliary definitions to cope with terms of the form R,a. Definition 17. The unfolding of concept assertions is given by: C(s) if s is not of the form R,t C(s) := for some R and t, ( R.C(t)) if s R,t for some R and t. Definition 18. The function basis determines the individual constant occurring in the unfolding of a concept assertion. It is defined by: basis(a) := a if a C, basis(a) := a if a is of the form R,b,i for some R and b, basis(a) := basis(b) if a is of the form R,b for some R and b. Lemma 19. Assume basis( R,a ) C. Then we have O C( R,a ) O = C( R,a ). Proof. We show by induction on α that for all α the following holds O α C( R,a ) O = C( R,a ). If O 0 C( R,a ), then we have one of the following two cases. 16

C. We know that O = R 1... R n. (a) for all role names R 1,..., R n and individuals a C. Hence the claim holds. C A and A O. This implies O = A and we can argue as in the previous case. If there exists β with O β C( R,a) and O β+1 C( R,a ), then we distiguish the following cases. C D( R,a ). By the induction hypothesis we have and hence also O = C( R,a ). O = C D( R,a ) D C( R,a ). Analogous to the previous case. R 2.C(b) and O β R 2 (b, R,a ). We find that O β R 2 (b, R,a ) implies b a and R 2 R. Therefore we have O β R.C(a). We distiguish two cases. 1. If a C, then Lemma 16 yields O = R.C(a). This is the same as O = C( R,a ). 2. If a C, then a is of the form R3,c and basis(a) C. Hence the induction hypothesis gives O = R.C(a). This is the same as O = C( R,a ). D( R,a ) and D C O. With the induction hypothesis we get O = D( R,a ). By O = D C we conclude O = C( R,a ). R 2.D( R,a ) and D C O. We infer O = R 2.D( R,a ) by the induction hypothesis. Together with O = D C, this gives us O = R 2.C( R,a ). R 2.A( R,a ) and O β R 2. A( R,a ). Applying the induction hypothesis to both statements results in O = R 2.A( R,a ) as well as O = R 2. A( R,a ). The conjunction of these two statements is equivalent to O = 0.R 2 ( R,a ). R 2. n.r 3 ( R,a ) and O β R 2. m.r 3 ( R,a ) for n < m. Applying the induction hypothesis yields O = R 2. n.r 3 ( R,a ) as well as O = R 2. m.r 3 ( R,a ). The conjunction of these two statements is equivalent to O = 0.R 2 ( R,a ). 17

R 2. ( R,a ). The induction hypothesis yields O = R 2. ( R,a ). This is equivalent to O = 0.R 2 ( R,a ). Lemma 20. Let R,a,i be an element of Ĉ. We have O C( R,a,i ) O C( R,a ). Proof. We show by induction on α that α.(o α C( R,a,i ) O C( R,a )). If O 0 C( R,a,i ), then we have either C or C A with A O. In both cases our claim trivially holds. Now assume that there exists β with O β C( R,a,i) and O β+1 C( R,a,i ). We distiguish the following different cases. C D( R,a,i ). By the induction hypothesis we have and hence also O C( R,a ). O C D( R,a ) D C( R,a,i ). Analogous to the previous case. R 2.C(b) and O β R 2 (b, R,a,i ). We find that O β R 2 (b, R,a,i ) implies b a and R 2 R. Therefore we have both O β R.C(a) and O β R(a, R,a ). Hence we conclude O C( R,a ). A( R,a,i ) and A C O. By the induction hypothesis we get O A( R,a ). By the closure properties of we conclude O C( R,a ). R 2.D( R,a,i ) and D C O. We obtain O R 2.D( R,a ) by the induction hypothesis. Because of D C O and the closure properties of, we can conclude O R 2.C( R,a ). R 2.A( R,a,i ) and O β R 2. A( R,a,i ). Applying the induction hypothesis yields O R 2.A( R,a ) and O R 2. A( R,a ). Then the definition of O implies O 0.R 2 ( R,a ). R 2. n.r 3 ( R,a,i ) and O β R 2. m.r 3 ( R,a,i ) for n < m. Applying the induction hypothesis yields O R 2. n.r 3 ( R,a ) as well as O R 2. m.r 3 ( R,a ). Then the definition of O implies O 0.R 2 ( R,a ). 18

R 2. ( R,a,i ). We employ the induction hypothesis to obtain O R 2. ( R,a ) which implies O 0.R 2 ( R,a ). Now we can state the soundness theorem for the database instance that we have constructed. We want to show Ô = DB C(a) O = C(a) for all a C. In order to prove this statement, we have to formulate it more generally. This is done in the next theorem. Theorem 21. Assume basis(a) C. We have Ô = DB C(a) O = C(a). Proof. We show the statement by induction on the structure of C. If C is a base symbol, we get C(a) Ô. That is O C(a). Using Lemma 16 and Lemma 19, we obtain O = C(a). If C is of the form D, then we find O C(a). Again by Lemma 16 and Lemma 19, we obtain O = C(a). Assume C is of the form D E. By the induction hypothesis we get O = D(a) as well as O = E(a). Therefore O = D E(a) holds. Assume C is of the form R.D. We have the following two possibilities given by the definition of the = DB relation. Ô = DB D( R,a ). We apply the induction hypothesis to infer O = D( R,a ). By the unfolding we get O = ( R.D(a)) which is the same as O = C(a). Ô = DB D( R,a ). Then we have R(a, R,a ) Ô and (6) x.(r(a, x) Ô Ô = DB D(x)). (7) By Definition 3 we infer from (6) that k.r(a) O where k is the least number n such that n.r(a) O. Additionally, there exist k many individuals b j Ĉ such that R(a, b j) Ô. Since Ô = DB D( R,a ), we know by (7) that b j R,a for all those b j. Moreover, Lemma 20 implies by (7) that b j R,a,i. Therefore b j C and hence a C by the definition of O α R(a, b). Summing up, we have O = k.r(a) as well as O = R(a, b j ) and O = D(b j ) for the k many individuals b j C. Taking these facts together we finally obtain O = R.D(a). That is O = C(a) since a C. 19

Assume C is of the form n.r. Assume there are n many individuals b 1,..., b n C such that R(a, b i ) Ô. Note that this implies a C. Then we have R(a, b i ) O for 1 i n. Hence, we also have O = R(a, b i ) for 1 i n. Therefore, we conclude O = n.r(a) which is O = n.r(a). If there are only m < n many individual constants b i C with R(a, b i ) Ô, then Ô = DB n.r implies that there are at least n m many constant R,a,i such that R(a, R,a,i ) Ô. This only is possible if O n.r(a) holds. We conclude O = n.r(a) by Lemma 16 and Lemma 19. Assume C is of the form n.r. We have {x R(a, x) Ô} Ĉ n and there exists an individual x such that R(a, x) Ô. Then by Definition 3 there exists a k n such that O k.r(a). This implies O = k.r(a). Because of k n, we can finally conclude O = n.r(a) which is O = C(a). Corollary 22. Let O be a satisfiable knowledge base. Let C be a concept description of O and a be an individual of C. Let Ô be the database instance as above. Then we have O = C(a) Ô = DB C(a). 5 Applications Ontologies are an important tool for dealing with structured information. They provide a formal and machine processable representation of such information. This enables automatic support for acquiring, maintaining and accessing it. We present an application of the (relational representation) of ontological knowledge bases in the context of online analytical processing (OLAP). OLAP is a term that has been introduced in the white paper [11] by Codd et al. It is an approach to quickly (online) providing answers to complex (analytic) database queries, for instance in the area of business analysis and reporting. One of the main features of an OLAP system is multidimensionality. According to [23], this is certainly the most logical way to analyze businesses and organizations. The multidimensional conceptual view of the 20

data also includes full support for hierarchies and multiple hierarchies. For example, the time dimension may be organized in a hierarchy having the following aggregation paths: DateTime, Hour of Day DateTime, Date, Day of Week DateTime, Date, Month, Year This makes it possible for an analyst to examine the data at the appropriate level of detail and to perform drill down or rollup operations if needed. However, dimensions may also be structured in much more complex ways. Consider a location dimension. This dimension usually has a geographic aggregation path with the levels State, Country and Continent. In addition, it may be structured according to the international organizations a Country belongs to, such as G8, UN Security Council or the European Union. Then there is also an economic structure induced for example by the notions of Least Developed Countries, Developing Coutries and Developed Countries. Further, the location dimension can be organized by the traffic routes to which a location is connected. This may be instances of roads, railways, rivers and seas. Of course, there are many more concepts according to which the location dimension may be ordered. Ontologies formally model concept definitions and relations among them. Hence, we can use ontologies to describe complex dimension hierarchies. For example, to talk about locations, we can use the terms and relations of the CIA World Fact Book which is available as OWL ontology [13]. This provides an explicit model of a location hierachy. If we build the database representation of such a location knowledge base, then we get in the database a constant for each country of the knowledge base and a unary relation for each base symbol of the knowledge base. The standard approach to dimension modeling [20] uses a so-called star schema. It is composed of a fact table and several dimension tables. The fact table consists of one or more numerical facts and a multipart key. Each dimension table has a single part primary key that corresponds exactly to one of the components of the key of the fact table. For example, if we have a location dimension, then the coresponding key will determine a country. Our setting for ontology based online analytical processing is as follows. We use the same fact table as in the standard dimensional modeling. However, instead of the dimension tables, we employ the relational representation of the knowledge base. 21

Consider the data warehouse of an international transportation company. It will contain a fact table storing information about which goods have been transported from which country to which country. Hence, the primary key of the fact table will contain at least three parts, namely good, fromcountry and tocountry. If we are interested only in things being shipped to a European Union country, then we can join the fact table and the relation that represents the EuropeanUnion concept. via the tocountry attribute. If we would like to further restrict our query to transports inside the EU, then we can build another join with the EuropeanUnion concept, this time via the fromcountry attribute of the fact table. Of course we cannot only use base concepts in our OLAP queries but also arbitray concepts C definable in our knowledge base language. To do so, we have to replace the relation for the base concept in the query by a corresponding (sub-) query which retrieves the individuals belonging to the concept C. There is an obvious way to optimize such queries. In the original knowledge base we did not only have base symbols as atomic concepts but also name symbols for which a definition was given in the terminological part of the knowledge base. The fact that a name symbol has been introduced for a certain concept C shows that C is an important concept which will be used often. Hence, we can introduce a materialized view for this concept so that we do not need to compute its extensions from scratch every time it is occurs in a query. For an efficient implementation of this approach, there is a big need for further optimizations. Rewriting queries using views has become one of the important tools for query optimization and applications in information integration and data warehousing. The recent paper [16] by Gasdoué and Rousset provides a uniform logical setting to examine the problems of answering and rewriting queries using views. In the context of description logics, these problems have previously been studied for example in [4, 7, 9, 24]. 6 Conclusions We have considered the retrieval problem for knowledge bases which are formulated in the description logic language ALN. We have shown that it is possible to represent such a knowledge base O as relational database instance Ô. A description logic retrieval query can be translated to a corresponding relational database query over Ô which provides exactly the answers to the original query over O. That is, our procedure is sound and complete. We have presented an application of our implementation of the retrieval problem in the context of online analytical processing. If we formalize (a 22

part of) the multidimensional hierarchical model as ALN knowledge base, then it becomes possible to use the definable concepts of this knowledge base to group the facts of the OLAP database. Interesting directions for continuing our work are to study database representations of more expressive description logic languages as well as to investigate optimization techniques for the retrieval problem in the case of relational representations of description logic knowledge bases. References [1] Rakesh Agrawal, Amit Somani, and Yirong Xu. Storage and querying of e-commerce data. In Proc. of the 27th Int. Conf. on Very Large Data Bases, pages 149 158. Morgan Kaufmann, 2001. [2] Grigoris Antoniou and Frank van Harmelen. A Semantic Web Primer. MIT Press, 2004. [3] Franz Baader, Diego Calvanese, Deborah McGuiness, Daniele Nardi, and Peter Patel-Schneider. The Description Logic Handbook. Cambridge, 2003. [4] Franz Baader, Ralf Küsters, and Ralf Molitor. Rewriting concepts using terminologies. In A. Cohn, F. Guinchiglia, and B. Selman, editors, Proc. of the 7th Int. Conf. on Knowledge Representation and Reasoning KR 2000, pages 297 308. Morgan Kaufmann, 2000. [5] Franz Baader and Werner Nutt. Basic description logic. In F. Baader, D. Calvanese, D. McGuiness, D. Nardi, and P. Patel-Schneider, editors, The Description Logic Handbook, pages 43 95. Cambridge, 2003. [6] Franz Baader and Ulrike Sattler. Expressice number restrictions in description logics. Journal of Logic and Computation, 9:319 350, 1999. [7] Catriel Beeri, Alon Y. Levi, and Marie-Christine Rousset. Rewriting queries using views in description logics. In Proc. of the 16th symp. on principles of database systems PODS 97. ACM, 1997. [8] Jeen Broekstra, Arjohn Kampman, and Frank van Harmelen. Sesame: A generic architecture for storing and querying RDF and RDF schema. In The Semantic Web - ISWC 2002, volume 2342 of LNCS, pages 54 68. Springer, 2002. 23

[9] Diego Calvanese, Giuseppe De Giacomo, and Maurizio Lenzerini. Answering queries using views over description logics knowledge bases. In Proc. of the 16th Nat. Conf. on Artificial Inteligence AAAI 2000, pages 386 391, 2000. [10] Iulian Ciorascu, Claudia Ciorascu, and Kilian Stoffel. Scalable ontology implementation based on knowler. In Workshop on Practical and Scaleable Semantic Web Systems, ISWC 2003, 2003. [11] E.F. Codd, S.B. Codd, and C.T. Sally. Providing OLAP to user-analysts: An IT mandate, 1993. Available at http://www.hyperion.com. [12] George P. Copeland and Setrag Khoshafian. A decomposition storage model. In S. B. Navathe, editor, Proc. of the 1985 ACM SIGMOD Int. Conf. on Management of Data, pages 268 279. ACM Press, 1985. [13] Mike Dean. CIA world fact book in OWL, 2003. Available at http://www.daml.org/2003/09/factbook. [14] Francesco Donini, Maurizio Lenzerini, Daniele Nardi, and Werner Nutt. The complexity of concept languages. Information and Computation, 134:1 58, 1997. [15] Francesco Donini, Maurizio Lenzerini, Daniele Nardi, and Andrea Schaerf. AL-log: integrating datalog and description logics. Journal of Intelligent Information Systems, 10:227 252, 1998. [16] François Gasdoué and Marie-Christine Rousset. Answering queries using views: a krdb perspective for the semantic web. To appear in ACM Transactions on Internet Technology. [17] Volker Haarslev and Ralf Möller. RACER system description. In Proc. of the Int. Joint Conf. on Automated Reasoning, volume 2083 of LNCS, pages 701 705. Springer, 2001. [18] Ian Horrocks. The FaCT system. In Proc. of the 2nd Int. Conf. on Analytic Tableaux and Related Methods, volume 1397 of LNAI, pages 307 312. Springer, 1998. [19] Jena - a semantic web framework for java. Available at http://jena.sourceforge.net. [20] Ralph Kimball, Laura Reeves, Margy Ross, and Warren Thornthwaite. The Datawarehouse Lifecycle Toolkit. Wiley, 1998. 24

[21] Alon Levy and Marie-Christine Rousset. Combining horn rules and description logics in CARIN. Artificial Intelligence Journal, 104, 1998. [22] Zhengxiang Pan and Jeff Heflin. DLDB: Extending relational databases to support semantic web queries. In Workshop on Practical and Scaleable Semantic Web Systems, ISWC 2003, pages 109 113, 2003. [23] Nigel Pendse. What is OLAP, 2004. Available at http://www.olapreport.com/fasmi.htm. [24] Marie-Christine Rousset. Backward reasoning in aboxes for query answering. In E. Franconi and M. Kifer, editors, Proc. of the 6th Int. Workshop on Knowledge Representation meets Databases (KRDB 99), volume 21 of CEUR Workshop Proceedings, pages 50 54, 1999. [25] Kilian Stoffel, Merwyn Taylor, and James Hendler. Efficient management of very large ontologies. In Proc. of the Nat. Conf. on Artificial Inteligence AAAI 1997, 1997. [26] Aida Vitória, Margarida Mamede, and Luís Monteiro. The retrieval problem in a concept language with number restrictions. In C. Pinto and N.J. Mamede, editors, Proc. of the 7th Portuguese Conf. on Artificial Intelligence, volume 990 of LNAI. Springer, 1995. Address Thomas Studer, Institut für Informatik und angewandte Mathematik, Universität Bern, Neubrückstrasse 10, CH-3012 Bern, Switzerland, tstuder@iam.unibe.ch 25