Learning and deduction in neural networks and logic

Size: px

Start display at page:

Download "Learning and deduction in neural networks and logic"

Myrtle Thornton
5 years ago
Views:

1 Learning and deduction in neural networks and logic Ekaterina Komendantskaya Department of Mathematics, University College Cork, Ireland Abstract We demonstrate the interplay between learning and deductive algorithms in logic and neural networks. In particular, we show how learning algorithms recognised in neurocomputing can be used to built connectionist neural networks which simulate the work of semantic operator for logic programs with uncertainty and conventional algorithm of SLD-resolution. Key words: Logic programs, SLD-resolution, Uncertain Reasoning, Connectionism, Neurocomputing 1 Introduction The human mind... infinitely surpasses the powers of any finite machine. This manifest by Kurt Gödel proclaimed in his Gibbs lecture to the American Mathematical Society in 1951, still challenges researchers in many areas of computer science to built a device that will prove to be more powerful than human brain. As advocated in [1], Gödel specifically claimed that Turing overlooked that the human mind in its use, is not static but constantly developing. This particular concern of Gödel found a further development in connectionism and neurocomputing. Connectionism is a movement in the fields of artificial intelligence, cognitive science, neuroscience, psychology and philosophy of mind which hopes to explain human intellectual abilities using artificial neural networks. Neural networks are simplified models of the brain composed of large numbers of units (the analogs of neurons) together with weights that measure the strength of connections between the units. Neural networks demonstrate an ability to address: e.komendantskaya@mars.ucc.ie (Ekaterina Komendantskaya). Preprint submitted to Elsevier 22 October 2006

2 learn such skills as face recognition, reading, and the detection of simple grammatical structure. We will pay special attention to one of the topics of connectionism, namely, the so-called topic of neuro-symbolic integration, which investigates ways of integration of logic and formal languages with neural networks in order to better understand the essence of symbolic (deductive) and human (creative) reasoning, and to show interconnections between them. Neurocomputing is defined in [2] as the technological discipline concerned with information processing systems (neural networks) that autonomously develop operational capabilities in adaptive response to an information environment. Moreover, neurocomputing is often seen as alternative to programmed computing: the latter is based on the notion of some fixed algorithm (a set of rules) which must be performed by a machine; the former does not necessary require an algorithm or rule development. Note that in particular neurocomputing is specialised on creation of machines implementing neural networks, among them are digital, analog, electronic, optical, electro-optic, acoustic, mechanical, chemical and some other types of neurocomputers, see [2] for further details about physical realizations of neural networks. In this paper we will use methods and achievements of both connectionism and neurocomputing: namely, we take the connectionist neural networks of [3] as a starting point, and show the ways of how they can be enriched by learning functions and algorithms implemented in neurocomputing. The main question we address is how learning in neural networks and deduction in logic can complement each other. The main question splits into two particular questions: Is there something in deductive, automated reasoning that corresponds (or may be brought into correspondence) to learning in neural networks? Can first-order deductive mechanisms be realized in neural networks, and if they can, will resulting neural networks be learning or static? First we fix precise meanings of terms deduction and learning. Deductive reasoning is inference in which the conclusion is necessitated by, or deduced from, previously known facts or axioms. Notion of learning is often seen as opposite to the notion of deduction. We adopt the definition of learning from neurocomputing, see [4]. Learning is a process by which the free parameters of a neural network are adapted through a continuing process of simulation by the environment in which the network is embedded. The type of learning is determined by the manner in which the parameter changes take place. The definition of learning process implies the following sequence of events: The neural network is stimulated by an environment. The neural network undergoes changes as a result of this simulation. 2

3 The neural network responds in a new way to the environment, because of the changes that have occurred in its internal structure. In this paper we will borrow techniques from both supervised and unsupervised learning paradigms, and in particular we will use such algorithms, as errorcorrection learning (see Section 3.2.1), Hebbian learning (see Section 2.2), competitive learning (see 3.2.3) and filter learning (see Section 3.2.2). The field of neuro-symbolic integration is stimulated by the fact that formal theories (as studied in mathematical logic and used in automated reasoning) are commonly recognised as deductive systems which lack such properties of human reasoning, as adaptation, learning and self-organisation. On the other hand, neural networks, introduced as a mathematical model of neurons in human brains, claim to possess all of the mentioned abilities, and moreover, they provide parallel computations and hence can perform certain calculations faster than classical algorithms. As a step towards integration of the two paradigms, there were built connectionist neural networks [3] which can simulate the work of semantic operator for propositional and (function-free) first-order logic programs. Those neural networks, however, were essentially deductive and could not learn or perform any form of self-organisation or adaptation; they could not even make deduction faster or more effective. There were several attempts to make these neural networks more intelligent, see, for example, [5], [6], [7] for some further developments. In this paper we propose the two ways of how connectionist neural networks of [3] can be brought closer to original ideas of connectionism and neurocomputing. The first way shows that enriched connectionist neural networks of [3] built to simulate the logic programs for reasoning with uncertainty incorporate at least two learning functions and thus posses ability of unsupervised learning, see Section 2 for further details. The second way employs the use of SLD-resolution, and in particular, we build neural networks which can simulate the work of SLD-resolution algorithm, and show that these neural networks have several learning functions, and thus perform different types of supervised and unsupervised learning recognised in neurocomputing. 1.1 Logic Programming and Neural Networks In the following two subsections we briefly overview the background definitions, notions and theorems concerning logic programming, neural networks and neuro-symbolic integration. The notation and the main results we overview in this section will be assumed throughout the paper. 3

4 1.1.1 Logic Programs and their Fixpoint Semantics We fix a first-order language L consisting of constant symbols a 1, a 2,..., variables x 1, x 2,..., function symbols f 1, f 2,..., predicate symbols Q 1, Q 2,..., connectives,, and quantifiers,. We follow the conventional definition of a term and a formula. Namely, constant symbol is a term, variable is a term, and if f i is a function symbol and t is term, then f i (t) is a term. Let Q i be a predicate symbol and t be a term. Then Q i (t) is a formula (also called an atomic formula or an atom). If F 1, F 2 are formulae, than F 1, F 1 F 2, F 1 F 2, xf 1 and xf 1 are formulae. Formula of the form x(a B 1... B n ), where A is an atom and each B i is either an atom or a negation of an atom is called a Horn clause. A Logic Program P is a set of Horn clauses, and it is common to use the notation A B 1,..., B n for Horn clauses of the form x(a B 1... B n ), see [8] for further details. If each B i is positive, then we call the clause definite. The logic program that contains only definite clauses is called a definite logic program. Example 1 This is a simple example of a definite logic program. Q 1 (a 1 ) Q 2 (a 1 ) Q 3 (a 1 ) Q 1 (a 1 ), Q 2 (a 1 ) The Herbrand Universe U P for a logic program P is the set of all ground terms which can be formed out of symbols appearing in P. The Herbrand Base B P for a logic program P is the set of all ground atoms which can be formed by using predicate symbols appearing in P with terms from U P. Let I be an interpretation of P. Then an Herbrand Interpretation I for P is defined as follows: I = {Q i (t 1,..., t n ) B P : Q i (t 1,..., t n ) is true with respect to I}. Semantic operator T P was first introduced in [9], see also [8], as follows: T P (I) = {A B P : A B 1,..., B n is a ground instance of a clause in P and {B 1,..., B n } I}. The semantic operator provides the Least Herbrand Model characterisation for each definite logic program. Define T P 0 = ; T P α = T P (T P (α 1)), if α is a successor ordinal; and T P α = (T P β : β α), if α is a limit ordinal. The following theorem establishes correspondence between the Least Herbrand Model of P and iterations of T P. Theorem 2 [9] Let M P denote the Least Herbrand Model of a definite logic 4

5 program P. Then M P = lfp(t P ) = T P ω. Example 3 The following is the Least Herbrand Model for the logic program P from Example 1: T P 2 = {Q 1 (a 1 ), Q 2 (a 1 ), Q 3 (a 1 )} Connectionist Neural Networks We follow the definitions of a connectionist neural network given in [3], see also [5] and [10] for further developments of the connectionist neural networks. A connectionist network is a directed graph. A unit k in this graph is characterised, at time t, by its input vector (v i1 (t),... v in (t)), its potential p k (t), its threshold Θ k, and its value v k (t). Note that in general, all v i, p i and Θ i, as well as all other parameters of neural network can be performed by different types of data, the most common of which are real numbers, rational numbers [3], fuzzy (real) numbers, complex numbers, numbers with floating point, and some others, see [2] for more details. In the two major sections of this paper we use rational numbers (Section 2) and Gödel (integer) numbers (Section 3). Units are connected via a set of directed and weighted connections. If there is a connection from unit j to unit k, then w kj denotes the weight associated with this connection, and v k (t) = w kj v j (t) is the input received by k from j at time t. The units are updated synchronously. In each update, the potential and value of a unit are computed with respect to an activation and an output function respectively. Most units considered in this paper compute their potential as the weighted sum of their inputs minus their threshold: n k p k (t) = w kj v j (t) Θ k. j=1 The units are updated synchronously, time becomes t + t, and the output value for k, v k (t + t) is calculated from p k (t) by means of a given output function F, that is, v k (t + t) = F (p k (t)). For example, the output function we most often use in this paper is the binary threshold function H, that is, v k (t + t) = H(p k (t)), where H(p k (t)) = 1 if p k (t) > 0 and 0 otherwise. Units of this type are called binary threshold units. Example 4 Consider two units, j and k, having thresholds Θ j, Θ k, potentials p j, p k and values v j, v k. The weight of the connection between units j and k is denoted w kj. Then the following graph shows a simple neural network consisting of j and k. The neural network receives input signals v, v, v and sends an output signal v k. 5

6 v p j w kj p k v Θ j Θ k v j k v k We will mainly consider connectionist networks where the units can be organised in layers. A layer is a vector of units. An n-layer feedforward network F consists of the input layer, n 2 hidden layers, and the output layer, where n 2. Each unit occurring in the i-th layer is connected to each unit occurring in the (i + 1)-st layer, 1 i < n. The next theorem establishes a correspondence between the semantic operator T P defined for definite logic programs and the three-layer connectionist neural networks. Theorem 5 [3] For each definite propositional logic program P, there exists a 3-layer feedforward network computing T P. PROOF. The core of the proof is the translation algorithm from a logic program P into a corresponding neural network, which can be briefly described as follows. The input and output layer are vectors of binary threshold units of length m, where the i-th unit in the input and output layer represents the i-th proposition. All units in the input and output layers have thresholds equal to 0.5. For each clause A B 1,..., B n, do the following. Add a binary threshold unit c to the hidden layer. Connect c to the unit representing A in the output layer with weight 1. For each formula B j, 1 j n, connect the unit representing B j to c with the weight 1. Set the threshold Θ c of c to l 0.5, where l is the number of atoms occurring in B 1,..., B n. For each proposition A, connect unit representing A in the output layer with the unit representing A in the input layer via connection with weight 1. This theorem easily generalises to the case of function-free first-order logic programs, because the Herbrand Base for these logic programs is finite and hence it is possible to work with finite number of ground atoms instead of propositions. However, if we wish to use the construction of Theorem 5 to compute the semantic operator for conventional first-order logic programs, we will need to use some kind of approximation theorems as in [10],[11] or [12] to 6

7 give an account of cases when the Herbrand Base is infinite, and hence infinite number of neurons is needed to simulate the semantic operator for the logic programs. An essential property of the neural networks constructed in the proof of Theorem 5 is that number of iterations of T P (for a given logic program P ) corresponds to the number of iterations of the neural network built upon P. Example 6 Consider the logic program from Example 1. The following neural network computes T P 2 in two iterations. Q 1 (a 1 )Q 2 (a 1 )Q 3 (a 1 ) Q 1 (a 1 )Q 2 (a 1 )Q 3 (a 1 ) There are several immediate conclusions to be made from this introductory section. We have described a procedure of how the Least Herbrand model of a function-free logic program can be computed by a three-layer feedforward connectionist neural network. However, the neural networks do not possess some essential and most beneficial properties of artificial neural networks, such as learning and self-adaptation, ability to perform parallel computations. On the contrary, the connectionist neural networks we have described are essentially deductive procedure, which works no faster than conventional resolution for logic programs. Moreover, the connectionist neural networks depend on ground instances of clauses, and in case of logic programs containing function symbols will require infinitely long layers to compute the least fixed point of T P. This property does not agree with the very idea of neurocomputing, which advocates another principle of computation: effectiveness of both natural and artificial neural networks depends primary on their architecture, which is finite, but allows very sophisticated and well-trained interconnections between neurons. (Note that in human brains the overall number of neurons constantly decreases during the lifetime, but the neurons which remain active are able to increase and optimise interconnections with other neurons, see [4] for more details.) The way of approximation supports the opposite direction of increasing the number of units in the artificial neural network (in order to approximate its infinite counterpart) rather than of improving connections 7

8 within a given neural network. Recall that the original goal of connectionism was to combine the most beneficial properties of automated reasoning and neural networks in order to built a more effective an wiser algorithms. The natural questions is: why the connectionist neural networks of [3] do not possess the essential properties of conventional neural networks and what are the ways to improve them and bring learning, self-organisation and parallelism into their computations? There may be two alternative answers to the question. 1. The connectionist neural networks we have described simulate classical twovalued deductive system, and they inherit purely deductive way of processing a given database from classical logic. If this is the case, we can adapt the algorithm of building a connectionist neural network to non-classical logic programs. And different kinds of logics programs for reasoning with uncertainties and incomplete/inconsistent databases are natural candidates for this. In Section 2 we show that connectionist neural networks computing the Least Herbrand Model for a given bilattice-based annotated logic program will perform two kinds of unsupervised learning - such as Hebbian and Anti-Hebbian learning. These neural networks, however, still require approximation theorems when working with logic programs containing function symbols. This problem led us to yet another answer: 2. It may be the choice of T P operator which led us in the wrong direction. What if we try to simulate classical SLD-resolution by connectionist neural networks? The main benefit of it is that we will not need to work with infinite neural network architectures anymore. And we introduce an algorithm of constructing neural networks simulating the work of SLD-resolution in Section 3. In fact, the neural networks of this type possess many more useful properties than just finite architecture. In particular, we show that these neural networks incorporate six learning functions which are recognised in neurocomputing, and can perform parallel computations for certain types of program goals. We conclude the paper by discussion of benefits of two approaches and possibilities of building neural networks simulating SLD-resolution for logic programs with uncertainties. 8

9 2 Reasoning with Uncertainty and Neural Computations 2.1 Bilattice-based Logic Programs The notion of a bilattice was introduced in 80 s as a generalisation of the famous Belnap s lattice (NONE TRUE BOTH; NONE FALSE BOTH) as a suitable structure for interpreting different languages and programs working with uncertainties and incomplete or inconsistent databases, see [13], [14], [15], [16],[17] for further details and further motivation. Definition 7 [13] A bilattice B is a sextuple (B,,,,, ) such that (B,, ) and (B,, ) are both complete lattices, and : B B is a mapping satisfying the following three properties: 2 = Id B, is a dual lattice homomorphism from (B,, ) to (B,, ), and is the lattice homomorphism from (B,, ) to itself. Note that the lattice (B,, ) is traditionally thought of generalising the Boole lattice (FALSE,TRUE), and describing measures of truth and falsity. The lattice (B,, ) is thought of as measuring amount of information (or knowledge) between NONE and BOTH. We use here the fact that each distributive bilattice can be regarded as a product of two lattices, see [15]. Therefore, we consider only logic programs over distributive bilattices and regard the underlying bilattice of any program as a product of two lattices. Moreover, we always treat each bilattice we work with as isomorphic to some subset of B = L 1 L 2 = ([0, 1], ) ([0, 1], ), where [0, 1] is the unit interval of reals with the linear ordering defined on it. Elements of such a bilattice are pairs: the first element of each pair denotes evidence for a fact, and the second element denotes evidence against it. Thus, (1, 0) and (0, 1) are the analogues of TRUE and FALSE and are maximal respectively minimal in the truth ordering, while (1, 1) (or BOTH) and (0, 0) (or NONE) are respectively maximal and minimal elements in the knowledge ordering. We define an annotated bilattice-based language L to consist of individual variables, constants, functions and predicate symbols as described in Section 1.1.1, together with annotation terms which can consist of variables, constants and/or functions over a bilattice. Bilattice-based languages allow, in general, six connectives and four quantifiers, as follows:,,,,,, Σ, Π,,. But in this paper we restrict our attention to bilattice-based logic programs and will work explicitly only with,, Σ, the latter being existential quantifier with respect to the knowledge ordering. 9

10 An annotated formula is defined inductively as follows: if R is an n-ary predicate symbol, t 1,..., t n are terms, and (µ, ν) is an annotation term, then R(t 1,..., t n ) : (µ, ν) is an annotated formula (called an annotated atom). Annotated atoms can be combined to form complex formulae using the connectives and quantifiers. A bilattice-based annotated logic program (BAP) P consists of a finite set of annotated program clauses of the form A : (µ, ν) L 1 : (µ 1, ν 1 ),..., L n : (µ n, ν n ), where A : (µ, ν) denotes an annotated atom called the head of the clause, and L 1 : (µ 1, ν 1 ),..., L n : (µ n, ν n ) denotes L 1 : (µ 1, ν 1 )... L n : (µ n, ν n ) and is called the body of the clause; each L i : (µ i, ν i ) is an annotated literal called an annotated body literal of the clause. Individual and annotation variables in the body are thought of as being existentially quantified using Σ. In [18], we showed how the remaining connectives,, can be introduced into BAPs. Each annotated atom A : (µ, ν) is interpreted in two steps as follows: the firstorder atomic formula A is interpreted in B (we may also write it as I B (A) B) using the domain of interpretation and variable assignment, see [15], [14], [17], [18] for further details. Then we define the interpretation I B as follows: if I B (A) (µ, ν), we put I B (A : (µ, ν)) = (1, 0); and I B (A : (µ, ν)) = (0, 1) otherwise. Let I B be an interpretation for L and let F be a closed annotated formula of L. Then I B is a model for F if I B (F ) = (1, 0). We say that I B is a model for a set S of annotated formulae if I B is a model for each annotated formula of S. We say that F is a logical consequence of S if, for every interpretation I B of L, I B is a model for S implies I B is a model for F. Let B P and U P denote an annotation Herbrand base respectively Herbrand universe for a program P, which are essentially B P and U P defined in Section 1.1.1, but with annotation terms allowed in U P and attached to ground formulae in B P. In common with conventional logic programming, each Herbrand interpretation HI for P can be identified with the subset {R(t 1,..., t k ) : (α, β) B P R(t 1,..., t k ) : (α, β) receives the value (1, 0) with respect to I B } of B P it determines, where R(t 1,..., t k ) : (α, β) denotes a typical element of B P. This set constitutes an annotation Herbrand model for P. Finally, we let HI P,B denote the set of all annotation Herbrand interpretations for P. It was noticed in [16], [19],[18],[20],[21] that non-linear ordering of (bi)lattices influence both model properties and proof procedures for (bi)lattice-based logics, and this distinguishes them from classical and even fuzzy logic. In particular, both semantic operator and SLD-resolution for BAPs must reflect the non-linear ordering of bilattices, see [18], [21], [22]. 10

11 In [18], we introduced a semantic operator T P for BAPs, proved its continuity and showed that it computes at its least fixed point the least Herbrand model for a given BAP: Definition 8 We define the mapping T P : HI P,B HI P,B as follows: T P (HI) denotes the set of all A : (µ, ν) B P such that either (1) There is a strictly ground instance of a clause A : (µ, ν) L 1 : (µ 1, ν 1 ),..., L n : (µ n, ν n ), such that {L 1 : (µ 1, ν 1),..., L n : (µ n, ν n)} HI for some annotations (µ 1, ν 1),..., (µ n, ν n), and one of the following conditions holds for each (µ i, ν i): (a) (µ i, ν i) k (µ i, ν i ), (b) (µ i, ν i) k j Ji (µ j, ν j ), where J i is the finite set of those indices i, j {1,... n} such that L j = L i or (2) there are annotated strictly ground atoms A : (µ 1, ν 1),..., A : (µ k, ν k) HI such that (µ, ν) k (µ 1, ν 1)... (µ k, ν k). 1 The item 1a is the analogue of T P operator defined in Section Items 1b and 2 reflect properties of the bilattice structure, as further illustrated in the next example. Example 9 Consider a bilattice-based annotated logic program P which can collect and process information about connectivity of some (probabilistic) graph G. Suppose we have received information from two different sources: one reports that there is an edge between nodes a and b, the other - that there is no such. This is represented by the two unit clauses edge(a, b) : (1, 0), edge(a, b) : (0, 1). It will be reasonable enough to conclude that the information is contradictory, that is, to conclude that edge(a, b) : (1, 1). And this is captured in item 2. If, on the other hand, the program contains some clause of the form disconnected(g) : (1, 1) connected(a, c) : (1, 0), connected(a, c) : (0, 1); we may regard the clause disconnected(g) : (1, 1) connected(a, c) : (0, 0) as equally true. And this is captured in item 1b. Let B, A and C denote, respectively, edge(a, b), connected(a, c) and disconnected(g). Consider the logic program: B : (0, 1), B : (1, 0), A : (0, 0) B : (1, 1), C : (1, 1) A : (1, 0), A : (0, 1). The least fixed point of T P is T P 3 = {B : (0, 1), B : (1, 0), B : (1, 1), A : (0, 0), C : (1, 1)}. However, the item 1.a (corresponding to the classical semantic operator) would allow us to compute only T P 1 = {B : (0, 1), B : (1, 0)}, that is, only explicit consequences of a program, which then leads to a contradiction in the twovalued case. 1 Note that whenever F : (µ, ν) HI and (µ, ν ) k (µ, ν), then F : (µ, ν ) HI. Also, for each formula F, F : (0, 0) HI. 11

12 2.2 Neural Networks for Reasoning with Uncertainty We extend the approach of [3] described in section to learning neural networks which can compute logical consequences of BAPs. This will allow us to introduce hypothetical and uncertain reasoning into the framework of neural-symbolic computation. Bilattice-based logic programs can work with conflicting sources of information and inconsistent databases. Therefore, neural networks corresponding to these logic programs should reflect this facility as well, and this is why we introduce some forms of learning into neural networks. These forms of leaning can be seen as corresponding to unsupervised Hebbian learning which is widely implemented in neurocomputing. The general idea behind Hebbian learning is that positively correlated activities of two neurons strengthen the weight of the connection between them and that uncorrelated or negatively correlated activities weaken the weight of the connection (the latter form is known as Anti-Hebbian learning). The general conventional definition of Hebbian learning is given as follows, see [4] for further details. Let k and j denote two units and w kj denote a weight of the connection from j to k. We denote the value of j at time t as v j (t) and the potential of k at time t as p k (t). Then the rate of change in the weight between j and k is expressed in the form w kj (t) = F (v j (t), p k (t)), where F is some function. As a special case of this formula, it is common to write w kj (t) = η(v j (t))(p k (t)), where η is a constant that determines the rate of learning and is positive in case of Hebbian learning and negative in case of Anti-Hebbian learning. Finally, we update w kj (t + 1) = w kj (t) + w kj (t). Example 10 Consider the neural network from Example 4. The Hebbian learning functions described above can be used to train the weight w kj. In this section, we will compare the two learning functions we introduce with this conventional definition of Hebbian learning. First, we prove a theorem establishing a relationship between learning neural networks and BAPs with no function symbols occurring in either individual or annotation terms. (Since the annotation Herbrand base for these programs is finite, they can equivalently be seen as propositional bilattice-based logic programs with no functions allowed in the annotations.) In the next subsection, we will extend the result to first-order BAPs with functions in individual and annotation terms. Theorem 11 For each function-free BAP P, there exists a 3-layer feedforward learning neural network which computes T P. 12

13 PROOF. Let m and n be the number of strictly ground annotated atoms from the annotation Herbrand base B P and the number of clauses occurring in P respectively. Without loss of generality, we may assume that the annotated atoms are ordered. The network associated with P can now be constructed by the following translation algorithm. (1) The input and output layers are vectors of binary threshold units of length k, 1 k m, where the i-th unit in the input and output layers represents the i-th strictly ground annotated atom. The threshold of each unit occurring in the input or output layer is set to 0.5. (2) For each clause of the form A : (α, β) B 1 : (α 1, β 1 ),..., B m : (α m, β m ), m 0, in P do the following. 2.1 Add a binary threshold unit c to the hidden layer. 2.2 Connect c to the unit representing A : (α, β) in the output layer with weight 1. We will call connections of this type 1-connections. 2.3 For each atom B j : (α j, β j ) in the input layer, connect the unit representing B j : (α j, β j ) to c and set the weight to 1. (We will call these connections 1-connections also.) 2.4 Set the threshold θ c of c to l 0.5, where l is the number of atoms in B 1 : (α 1, β 1 ),..., B m : (α m, β m ). 2.5 If some input unit representing B : (α, β) is connected to a hidden unit c, connect each of the input units representing annotated atoms B : (α i, β i ),..., B : (α j, β j ) to c. These connections will be called -connections. The weights of these connections will depend on a learning function. If the function is inactive, set the weight of each -connection to 0. (3) If there are units representing atoms of the form B : (α i, β i ),..., B : (α j, β j ) in input and output layers, correlate them as follows. For each B : (α i, β i ), connect the unit representing B : (α i, β i ) in the input layer to each of the units representing B : (α i, β i ),..., B : (α j, β j ) in the output layer. These connections will be called the -connections. If an -connection is set between two atoms with different annotations, we consider them as being connected via hidden units with thresholds 0. If an -connection is set between input and output units representing the same annotated atom B : (α, β), we set the threshold of the hidden unit connecting them to 0.5, and we will call them -hidden units, so as to distinguish the hidden units of this type. The weights of all these - connections will depend on a learning function. If the function is inactive, set the weight of each -connection to 0. (4) Set all the weights which are not covered by these rules to 0. For each annotated atom A : (α, β), connect unit representing A : (α, β) in the output layer with the unit representing it in the input layer via weight 1. Allow two learning functions to be embedded into the -connections and the -connections. We let v i denote the value of the input unit representing 13

14 B : (α i, β i ) and p c denote the potential of the unit c. Let a unit representing B : (α i, β i ) in the input layer be denoted by i. If i is connected to a hidden unit c via an -connection, then a learning function φ 1 is associated to this connection. We let φ 1 = w ci (t 1) = (v i (t 1))( p c (t 1) + 0.5) become active and change the weight of the -connection from i to c at time t if i became activated at time (t 1), units representing atoms B : (α j, β j ),..., B : (α k, β k ) in the input layer are connected to c via 1- connections and (α i, β i ) k ((α j, β j )... (α k, β k )). Function φ 2 is embedded only into connections of type, namely, into - connections between hidden and output layers. Let o be an output unit representing an annotated atom B : (α i, β i ). Apply φ 2 = w oc (t 2) = (v c (t 2))(p o (t 2) + 1.5) to change w oc at time t if φ 2 is embedded into an - connection from the -hidden unit c to o and there are output units representing annotated atoms B : (α j, β j ),..., B : (α k, β k ), which are connected to the unit o via -connections, these output units became activated at time t 2 and (α i, β i ) k ((α j, β j )... (α k, β k )). Each annotation Herbrand interpretation HI for P can be represented by a binary vector (v 1,..., v m ). Such an interpretation is given as an input to the network by externally activating corresponding units of the input layer at time t 0. It remains to show that A : (α, β) T P n for some n if and only if the unit representing A : (α, β) becomes active at time t+2, for some t. The proof that this is so proceeds by routine induction, see Appendix A. Example 12 The following diagram displays the neural network which computes T P 3 from Example 9. Without functions φ 1, φ 2 the neural network will compute only T P 1 = {B : (0, 1), B : (1, 0)}, explicit logical consequences of the program. But it is the use of φ 1 and φ 2 that allows the neural network to compute T P 3. Note that arrows,, denote respectively 1-connections, -connections and -connections, and we have marked by φ 1, φ 2 the connections which are activated by the learning functions. According to the conventional definition of feedforward neural networks, each output unit denoting some atom is in its turn connected to the input unit denoting the same atom via a 1-connection and this forms a loop. We assume but do not draw these connections here. 14

15 A : (1, 1)A : (1, 0)A : (0, 1)A : (0, 0)B : (1, 1)B : (1, 0)B : (0, 1)B : (0, 0)C : (1, 1)C : (1, 0)C : (0, 1)C : (0, 0) φ 2 φ φ 1 φ A : (1, 1)A : (1, 0)A : (0, 1)A : (0, 0)B : (1, 1)B : (1, 0)B : (0, 1)B : (0, 0)C : (1, 1)C : (1, 0)C : (0, 1)C : (0, 0) We can make several conclusions from the construction of Theorem 11. Neurons representing annotated atoms with identical first-order (or propositional) components are joined into multineurons in which units are correlated using - and -connections. The learning function φ 2 roughly corresponds to Hebbian learning, with the rate of learning η 2 = 1, the learning function φ 1 corresponds to Anti- Hebbian learning with the rate of learning 1, and we regard η 1 as negative because the factor p c in the formula for φ 1 is multiplied by ( 1). The main problem Hebbian learning causes is that the weights of connections with embedded learning functions tend to grow exponentially, which cannot fit into the model of biological neurons. This is why traditionally some functions are introduced to bound the growth. In the neural networks we have built some of the weights may grow with iterations, but the growth will be very slow, because of the activation functions, namely, binary threshold functions, used in the computation of each v i. 2.3 Neural Networks and First-Order BAPs In this section we will briefly describe the essence of the approximation result of [11], but we apply it to BAPs. Since neural networks of [3] were proven to compute the least fixed points of the semantic operator defined for propositional logic programs, many attempts have been made to extend this result to first-order logic programs. See, for example, [10], [11]. We extend here the result obtained by Seda [11] for twovalued first-order logic programs to first-order BAPs. Let l : B P N be a level mapping with the property that, given n N, we can effectively find the set of all A : (µ, ν) B P such that l(a : (µ, ν)) = n. 15

16 The following definition is due to Fitting, and for further explanations see [11] or [18]. Definition 13 Let HI P,B be the set of all interpretations B P B. We define the ultrametric d : HI P,B HI P,B R as follows: if HI 1 = HI 2, we set d(hi 1, HI 2 ) = 0, and if HI 1 HI 2, we set d(hi 1, HI 2 ) = 2 N, where N is such that HI 1 and HI 2 differ on some ground atom of level N and agree on all atoms of level less then N. Fix an interpretation HI from elements of the Herbrand base for a given program P to the set of values {(1, 0), (0, 1)}. We assume further that (1, 0) is encoded by 1 and (0, 1) is encoded by 0. Let HI P denote the set of all such interpretations, and take the semantic operator T P as in Definition 8. Let F denote a 3-layer feedforward learning neural network with m units in the input and output layers. The input-output mapping f F is a mapping f F : HI P,B HI P,B defined as follows. Given HI HI P,B, we present the vector (HI(B 1 : (α 1, β 1 )),..., HI(B m : (α m, β m ))), to the input layer; after propagation through the network, we determine f F (HI) by taking the value of f F (HI)(A j : (α j, β j )) to be the value in the jth unit in the output layer, j = 1,..., m, and taking all other values of f F (HI)(A j : (α j, β j )) to be 0. Suppose that M is a fixed point of T P. Following [11], we say that a family F = {F i : i I} of 3-layer feedforward learning networks F i computes M if there exists HI HI P such that the following holds: given any ε > 0, there is an index i I and a natural number m i such that for all m m i we have d(f m i (HI), M) < ε, where f i denotes f Fi and f m i (HI) denotes the mth iterate of f i applied to HI. Theorem 14 Let P be an arbitrary annotated program, let HI denote the least fixed point of T P and suppose that we are given ε > 0. Then there exists a finite program P = P (ε) (a finite subset of ground(p )) such that d(hi, HI) < ε, where HI denotes the least fixed point of T P. Therefore, the family {F n n N} computes HI, where F n denotes the neural network obtained by applying the algorithm of Theorem 11 to P n, and P n denotes P (ε) with ε taken as 2 n for n = 1, 2, 3,.... This theorem clearly contains two results corresponding to the two separate statements made in it. The first concerns finite approximation of T P, and is a straightforward generalisation of a theorem established in [11]. The second is an immediate consequence of the first conclusion and Theorem 11. Thus, we have shown that the learning neural networks we have built can approximate the least fixed point of the semantic operator defined for first-order BAPs. 16

17 3 SLD-Resolution Performed in Connectionist Neural Networks This section introduces an alternative extension of connectionist neural networks, namely, we adapt ideas of connectionism to SLD-resolution. The resulting neural networks have finite architecture, have learning abilities and can perform parallel computations for certain kinds of program goals. This brings connectionist neural networks closer to artificial neural networks implemented in neurocomputing, see [2]. Furthermore, the fact that classical first-order derivations require the use of learning mechanisms if implemented in neural networks is very interesting on its own right and suggests that firstorder deductive theories are in fact capable of acquiring some new knowledge, at least at the extent of how this process is understood in neurocomputing. In this section we use the notions of first-order language and logic programs defined in Section SLD-Resolution in Logic Programming We briefly survey the notion of unification and SLD-resolution as it was introduced by [23] and [24], see also [8]. Let S be a finite set of atoms. A substitution θ is called a unifier for S if S is a singleton. A unifier θ for S is called a most general unifier (mgu) for S if, for each unifier σ of S, there exists a substitution γ such that σ = θγ. To find the disagreement set D S of S locate the leftmost symbol position at which not all atoms in S have the same symbol and extract from each atom in S the term beginning at that symbol position. The set of all such terms is the disagreement set. Unification algorithm: (1) Put k = 0 and σ 0 = ε. (2) If Sσ k is a singleton, then stop; σ k is an mgu of S. Otherwise, find the disagreement set D k of Sσ k. (3) If there exist a variable v and a term t in D k such that v does not occur in t, then put θ k+1 = θ k {v/t}, increment k and go to 2. Otherwise, stop; S is not unifiable. The Unification Theorem establishes that, for any finite S, if S is unifiable, then the unification algorithm terminates and gives an mgu for S. If S is not unifiable, then the unification algorithm terminates and reports this fact. Definition 15 Let a goal G be A 1,..., A m,..., A k and a clause C be A 17

18 B 1,..., B q. Then G is derived from G and C using mgu θ if the following conditions hold: A m is an atom, called the selected atom, in G. θ is an mgu of A m and A. G is the goal (A 1,..., A m 1, B 1,..., B q, A m+1,..., A k )θ. An SLD-derivation of P {G} consists of a sequence of goals G = G 0, G 1,..., a sequence C 1, C 2,... of variants of program clauses of P and a sequence θ 1, θ 2,... of mgu s such that each G i+1 is derived from G i and C i+1 using θ i+1. An SLD-refutation of P {G} is a finite SLD-derivation of P {G} which has the empty clause as the last goal of derivation. If G n =, we say that refutation has length n. The success set of P is the set of all A B P such P { A} has an SLD-refutation. If θ 1,..., θ n is the sequence of mgus used in SLD-refutation of P {G}, then a computed answer θ for P {G} is obtained by restricting θ 1,..., θ n to variables of G. We say that θ is a correct answer for P {G} if ((G)θ) is a logical consequence of P. The SLD-resolution is sound and complete, which means that the success set of a definite program is equal to its least Herbrand model. The completeness can alternatively be stated as follows. For every correct answer θ for P {G}, there exists a computed answer σ for P {G} and a substitution γ such that θ = σγ. Example 16 Consider the following logic program P 1, which determines, for each pair of numbers x 1 and x 2 whether x 1 x2 is defined. Let Q 1 denote the property to be defined, (f 1 (x 1, x 2 )) denote x 1 x2 ; Q 2, Q 3 and Q 4 denote, respectively, the property of being an even number, nonnegative number and odd number. Q 1 (f 1 (x 1, x 2 )) Q 2 (x 1 ), Q 3 (x 2 ) Q 1 (f 1 (x 1, x 2 )) Q 4 (x 1 ) This program can be enriched with a program which determines whether a given number odd or even, but we will not address this issue here. To keep our example simple, we chose G 0 = Q 1 (f 1 (a 1, a 2 )), where a 1 = 2 and a 2 = 3, and add Q 2 (a 1 ) and Q 3 (a 2 ) to the database. Now the process of SLDrefutation will proceed as follows: (1) G 0 = Q 1 (f 1 (a 1, a 2 )) is unifiable with Q 1 (f 1 (x 1, x 2 )), and the algorithm of unification can be applied as follows: Form S = {Q 1 (f 1 (a 1, a 2 )), Q 1 (f 1 (x 1, x 2 ))} Form D S = {x 1, a 1 }. Put θ 1 = 18

19 x 1 /a 1. Now Sθ 1 = {Q 1 (f 1 (a 1, a 2 )), Q 1 (f 1 (a 1, x 2 ))}. Find D Sθ = {x 2, a 2 } and put θ 2 = x 2 /a 2. Now Sθ 1 θ 2 is a singleton. (2) Form the next goal G 1 = (Q 2 (x 1 ), Q 3 (x 2 ))θ 1 θ 2 = Q 2 (a 1 ), Q 3 (a 2 ). Q 2 (a 1 ) can be unified with the clause Q 2 (a 1 ) and no substitutions are needed. (3) Form the goal G 2 = Q 3 (a 2 ), and it is unified with the clause Q 3 (a 2 ). (4) Form the goal G 3 =. There is a refutation of P 1 G 0, the answer is θ 1 θ 2, and, because the goal G 0 is ground, the correct answer is empty. 3.2 Useful Learning Techniques The following three subsections define the types of learning, which are widely recognised and applied in neurocomputing and which will be used for simulations of SLD-resolution in neural networks later on Error-Correction Learning in Artificial Neural Networks We will use the algorithm of error-correction learning to simulate the process of unification described in the previous section. Error-correction learning is one of the algorithms among the paradigms which advocate supervised learning. The supervised learning is the most popular type of learning implemented in artificial neural networks, and we give a brief sketch of error-correction algorithm in this subsection; see, for example, [4] for further details. Let d k (t) denote some desired response for unit k at time t. Let the corresponding value of the actual response be denoted by v k (t). The response v k (t) is produced by a stimulus (vector) v j (t) applied to the input of the network in which the unit k is embedded. The input vector v k (t) and desired response d k (t) for unit k constitute a particular example presented to the network at time t. It is assumed that this example and all other examples presented to the network are generated by an environment. We define an error signal as the difference between the desired response d k (t) and the actual response v k (t) by e k (t) = d k (t) v k (t). The error-correction learning rule is the adjustment w kj (t) made to the weight w kj at time n and is given by w kj (t) = ηe k (t)v j (t), where η is a positive constant that determines the rate of learning. 19

20 Finally, the formula w kj (t + 1) = w kj (t) + w kj (t) is used to compute the updated value w kj (t + 1) of the weight w kj. We use formulae defining v k and p k as in Section Example 17 The neural network from Example 4 can be transformed into an error-correction learning neural network as follows. We introduce the desired response value d k into the unit k, and the error signal e k computed using d k must be sent to the connection between j and k to adjust w kj. v p j w kj + w kj v Θ j v e k Θ k, d k j w kj k e k, v k Filter Learning and Grossberg s Law Filter learning, similar to the Hebbian learning, is a form of unsupervised learning, see [2] for further details. Filter learning and in particular Grossberg s law will be used in simulations of SLD-resolution at stages when a network, acting in accordance with conventional SLD-resolution, must choose, for each atom in the goal, with which particular clause in the program it will be unified at the next step. Consider the situation when a unit receives multiple input signals, v 1, v 2,..., v n, with v n distinguished signal. In Grossberg s original neurobiological model [25], the v i, i n, were thought of as conditioned stimuli and the signal v n was an unconditioned stimulus. Grossberg assumed that v i, i n was 0 most of the time and took large positive value when it became active. Choose some unit c with incoming signals v 1, v 2,..., v n. Grossberg s law is expressed by the equation w new ci = w old ci + a[v i v n w old ci ]U(v i ), (i {1,..., n 1}), where 0 a 1 and where U(v i ) = 1 if v i > 0 and U(v i ) = 0 otherwise. Example 18 Consider the unit j from Example 4. It receives multiple input values, and this means that we can apply Grossberg s learning law to the weights connecting v, v and v with j. In the next section we will also use the inverse form of Ginsberg s law and apply the equation w new ic = w ic (t) old + a[v i v n w old ic ]U(v i ), (i {1,..., n 1}) to enable (unsupervised) change of weights of connections going from some 20

21 unit c which sends outcoming signals v 1, v 2,... v n to units 1,..., n respectively. This will enable outcoming signals of one unit to compete with each other Kohonen s Layer and Competitive Learning The definition of a Kohonen Layer was introduced in conjunction with the famous Kohonen learning law [26], and both notions are usually thought of as fine examples of competitive learning. We are not going to use the latter notion in this paper and will concentrate on the notion of Kohonen layer, which is defined as follows. The layer consists of N units, each receiving n input signals v 1,..., v n from another layer of units. The v j input signal to Kohonen unit i has a weight w ij assigned to it. We denote by w i the vector of weights (w i1,..., w in ), and we use v to denote the vector of input signals (v 1,..., v n ). Each Kohonen unit calculates its input intensity I i in accordance with the following formula: I i = D(w i, v), where D(w i, v) is the distance measurement function. The common choice for D(w i, v) is the Euclidian distance D(w i, v) = w i v. Once each Kohonen unit has calculated its input intensity I i, a competition takes place to see which unit has the smallest input intensity. Once the winning Kohonen unit is determined, its output v i is set to 1. All the other Kohonen unit output signals are set to 0. Example 19 The neural network below consists of three units organised in one layer; the competition according to the algorithm described above will take place between those three units, and only one of them will eventually emit the signal. w 1 w 2 w 3 v 1 v 2 v SLD-resolution in neural networks In order to perform SLD-resolution in neural networks, we will allow not only binary threshold units in the connectionist neural networks, but also units which may receive and send Gödel numbers as signals. 21

22 We will use the fact that every first-order language yields a Gödel enumeration. There are several ways of performing the enumeration, we just fix one as follows. Each symbol of the first-order language receives a Gödel number as follows: variables x 1, x 2, x 3,... receive numbers (01), (011), (0111),...; constants a 1, a 2, a 3,... receive numbers (21), (211), (2111),...; function symbols f 1, f 2, f 3,... receive numbers (31), (311), (3111),...; predicate symbols Q 1, Q 2, Q 3,... receive numbers (41), (411), (4111),...; symbols (, ) and, receive numbers 5, 6 and 7 respectively. It is possible to enumerate connectives and quantifiers, but we will not need them here and so omit further enumeration. Example 20 The following is the enumeration of atoms from Example 16, the rightmost column contains short abbreviations we use for these numbers in further examples: Atom Gödel Number Label Q 1 (f 1 (x 1, x 2 )) g 1 Q 2 (x 1 ) g 2 Q 3 (x 2 ) g 3 Q 3 (a 2 ) g 4 Q 2 (a 1 ) g 5 Q 1 (f 1 (a 1, a 2 )) g 6 Disagreement set can be defined as follows. Let g 1, g 2 be Gödel numbers of two arbitrary atoms A 1 and A 2 respectively. Define the set g 1 g 2 as follows. Locate the leftmost symbols j g1 and j g2 in g 1 and g 2 which are not equal. If j gi, i {1, 2} is 0, put 0 and all successor symbols 1,..., 1 into g 1 g 2. If j gi is 2, put 2 and all successor symbols 1,..., 1 into g 1 g 2. If j gi is 3, then extract first two symbols after j gi and then go on extracting successor symbols until number of occurrences of symbol 6 becomes equal to the number of occurrences of symbol 5, put the number starting with j gi and ending with the last such 6 in g 1 g 2. It is a straightforward observation that g 1 g 2 is equivalent to the notion of the disagreement set D S, for S = {A 1, A 2 } as it is defined in Section 3.1. We will also need the operation, concatenation of Gödel numbers, defined by g 1 g 2 = g 1 8g 2. Let g 1 and g 2 denote Gödel numbers of a variable x i and a term t respectively. 22

First-order deduction in neural networks

First-order deduction in neural networks Ekaterina Komendantskaya 1 Department of Mathematics, University College Cork, Cork, Ireland e.komendantskaya@mars.ucc.ie Abstract. We show how the algorithm of