arxiv: v1 [cs.ai] 24 Mar 2016

Size: px

Start display at page:

Download "arxiv: v1 [cs.ai] 24 Mar 2016"

Jane Doyle
5 years ago
Views:

1 arxiv: v1 [cs.ai] 24 Mar 216 Probabilistic Reasoning via Deep Learning: Neural Association Models Quan Liu and Hui Jiang and Zhen-Hua Ling and Si ei and Yu Hu National Engineering Laboratory or Speech and Language Inormation Processing University o Science and Technology o China, Heei, Anhui, China Department o Electrical Engineering and Computer Science, York University, Canada iflytek Research, Heei, China s: quanliu@mail.ustc.edu.cn, hj@cse.yorku.ca, zhling@ustc.edu.cn siwei@ilytek.com, yuhu@ilytek.com Abstract In this paper, we propose a new deep learning approach, called neural association model (NAM), or probabilistic reasoning in artiicial intelligence. e propose to use neural networks to model association between any two events in a domain. Neural networks take one event as input and compute a conditional probability o the other event to model how likely these two events are associated. The actual meaning o the conditional probabilities varies between applications and depends on how the models are trained. In this work, as two case studies, we have investigated two NAM structures, namely deep neural networks (DNNs) and relation-modulated neural nets (s), on several probabilistic reasoning tasks in AI, including recognizing textual entailment, triple classiication in multi-relational knowledge bases and commonsense reasoning. Experimental results on several popular data sets derived rom ordnet, Free- ase and ConceptNet have all demonstrated that both DNNs and s perorm equally well and they can signiicantly outperorm the conventional methods available or these reasoning tasks. Moreover, comparing with DNNs, s are superior in knowledge transer, where a pre-trained model can be quickly extended to an unseen relation ater observing only a ew training samples. 1 Introduction Reasoning is an important topic in artiicial intelligence (AI), which has attracted considerable attention and research eort in the past ew decades [McCarthy, 1986; Minsky, 1988; Mueller, 214]. esides the traditional logic reasoning, probabilistic reasoning has been studied as another typical genre in order to handle knowledge uncertainty in reasoning based on the probability theory [Pearl, 1988; Neapolitan, 212]. The probabilistic reasoning aims to predict the conditional probability Pr(E 2 E 1 ) o one event E 2 given another event E 1. State-o-the-art methods or probability reasoning include ayesian networks [Jensen, 1996], Markov logic networks [Richardson and Domingos, 26] and other graphical models [Koller and Friedman, 29]. Taking ayesian networks as example, the conditional probabilities between two associated events are calculated as posterior probabilities according to the ayes theorem, where all possible events are modeled by a pre-deined graph structure. However, these methods quickly become intractable or most practical tasks where the number o all possible events is large. In recent years, distributed representations that map discrete language units into continuous vector s have gained signiicant popularity along with the development o neural networks [engio et al., 23; Collobert et al., 211; Mikolov et al., 213]. The main beneit or embedding in continuous is its smooth property to capture the semantic relatedness between discrete events, potentially generalizable to unseen events. Similar ideas, such as knowledge graph embedding, have been proposed to represent knowledge bases (K) in low-dimensional continuous [ordes et al., 213; Socher et al., 213; ang et al., 214; Nickel et al., 215]. Using the smoothed K representation, it is possible to reason over the relations among various entities. However, human-like reasoning remains as an extremely challenging problem because it requires to eectively encode world knowledge using powerul models. Most o the existing Ks are quite sparse and can only capture a raction o world knowledge, even or those recently created large-scale Ks, such as YAGO, NELL and Freebase. In order to take advantage o these sparse knowledge bases, the state-o-theart approaches or knowledge graph embedding usually adopt simple linear models, such as RESCAL [Nickel et al., 212] or statistical relational learning (SRL). Motivated by the recent successes o deep learning in pattern recognition, in this paper, we propose to use deep neural networks, called neural association model (NAM), to represent the sparse knowledge bases. Dierent rom the existing linear models, the proposed NAM model uses multi-layer nonlinear activations in deep neural nets to model the association conditional probabilities between any two possible events. In the proposed NAM ramework, all symbolic events are represented in low-dimensional continuous and we have no need to explicitly speciy any dependency structure among events as required in ayesian networks. Deep neural networks are used to model the association between any two events, taking one event as input to compute a conditional probability o another event. The computed conditional

2 tor probability or association may be generalized to model various reasoning problems, such as entailment inerence, relational learning, causation modelling and so on. In this work, we study two model structures or NAM. The irst model is a standard deep neural networks (DNN) and the second model uses a special structure called relation modulated neural nets (). Experiments on several probabilistic reasoning tasks, including recognizing textual entailment, triple classiication in multi-relational Ks and commonsense reasoning, have demonstrated that both DNN and can signiicantly outperorm other conventional methods. Moreover, the model is shown to be eective in knowledge transer learning, where a pre-trained model can be quickly extended to a new relation ater observing only a ew training samples. The main contributions o this paper can be summarized as ollows: e propose neural association models or probabilistic reasoning based on deep neural networks, which (L+1) are general enough to deal with various reasoning problems Score unction over symbolic events. One speciic model structure studied (L) in this work (L) is quiteout: eective z (L) or knowledge transer learning, which In: a can quickly (L) adapt an existing knowledge base to newly encountered scenarios and contexts. (L) To the best o our knowledge, this is the irst work to demonstrate out: z (2) that DNNs with the multi-layer nonlinearity are useul or probabilistic reasoning. 2 A Review on Statistical Relation Learning This paperout: mainly z (1) ocuses on probabilistic reasoning on multi-relational In: a (1) data. e irst review the ormulation o statistical relational learning (SRL) as in [Nickel et al., 215]. Let (1) Existed Relations Head entity vector E = {e 1,..., e Ne } be the set o all entities and R = {r 1,..., r Nr } be the set o all relation types in K, where N e is the total entity number and N r the relation number. SRL Head entity attempts vector to model each possible triple x ijk = (e i, r k, e j ) over the sets o entities and relations as a binary random variable y ijk {, 1} that indicates true or alse. All possible triples in E R E may be ormed as a third-order tensor Y {, 1} Ne Ne Nr. SRL aims to estimate the joint distribution o Y, Pr(Y), rom a training set D o all observed triples. Most SRL models assume that all y ijk are conditionally independent given the model parameters Θ. As a result, relying on a score unction (x ijk ; Θ) or each triple x ijk = (e i, r k, e j ), the likelihood unction o an SRL model may be derived as ollows: N e N e P(Y D, Θ) = N r i=1 j=1 k=1 er (y ijk σ ((x ijk ; Θ))) (1) where σ( ) denotes the sigmoid unction and er (y p) the ernoulli distribution as: { p, i y = 1 er (y p) = (2) 1 p, i y = Under the above ormulation, the key work is to deine the score unction or each triple. In [Nickel et al., 212], (2) (1) (1) they have proposed RESCAL to explain triples via pairwise interactions o latent eatures. In [Sutskever et al., 29; Jenatton et al., 212], a latent actor model is designed to capture the second-order correlation between entities using a quadratic orm. The work in [Lin et al., 215] attempts to learn the latent embeddings by projecting all entities rom the entity to the corresponding relation and then building translations among the projected entities. In [ordes et al., 212], they have proposed a model based on semantic matching energy to capture the correlation between entities and relations via multiple matrix products and Hadamard products. In [Socher et al., 213], neural tensor model (NTM) is constructed using a very expressive score unction with a large number o parameters. TransE presented in [ordes et al., 213] deines a linear unction or modeling triples in a joint o entities and relations. Finally, [ang et al., 214] propose to construct the score unction by projecting entities into a relation-speciic hyperplane. Obviously, almost all existing methods in the literature (L+1) adopt linear models or statistical relation learning. 3 Neural Association Models (L) (NAM) In this paper, we propose to use a nonlinear model, namely deep neural nets, or probabilistic reasoning. Our main goal is to use neural nets to model the association (2) probability (2) or any two events E 1 and E 2 in a domain, i.e., Pr(E 2 E 1 ) o E 2 conditioning on E 1. All possible events in the domain are projected into continuous (1) without speciying (1) any explicit dependency structure among them. In the ollowing, we irst introduce neural association models (NAM) as a general modeling ramework or probabilistic reasoning in section 3.1. Next, we describe two particular NAM Tail entity model vector structuresnew orrelation multi-relational Head data entity invector sections 3.2 and NAM in general Event E 1 Transering Deep Neural Networks Association in DNNs P(E 2 E 1 ) Figure 1: The NAM ramework in general. (L) Event E 2 Figure 1 shows the general ramework o NAM or associating two events, E 1 and E 2. In the general NAM ramework, the events are irst projected into a low-dimension continuous. Deep neural networks with multi-layer nonlinearity are used to model how likely these two events are associated. Neural networks take the embedding o one event E 1 (antecedent) as input and computes a conditional probability Pr(E 2 E 1 ) o the other event E 2 (consequent). I the event E 2 is binary (true or alse), the NAM models may use a sigmoid node to compute Pr(E 2 E 1 ). I E 2 takes multiple

3 Existed Relations (L+1) (L) (2) (1) mutually exclusive values, we use a ew sotmax nodes or Pr(E 2 E 1 ), where it may need to use multiple embeddings or E 2 (one per value). NAMs do not explicitly speciy how dierent events E 2 are actually related, they may be mutually exclusive, contained, intersected. NAMs are only used to separately compute conditional probabilities, Pr(E 2 E 1 ), or each pair o events, E 1 and E 2, in a task. The actual physical meaning o the conditional probabilities Pr(E 2 E 1 ) varies between applications and depends on how the models are trained. Table 1 lists a ew possible applications or NAMs. Application E 1 E 2 language modeling h w causal reasoning cause eect knowledge triple classiication {e i, r k } e j lexical entailment 1 2 textual entailment D 1 D 2 Table 1: Some applications or NAMs. In language modeling, the antecedent event is the representation o history context, h, and the consequent event is next word w that takes one out o K values. In causal reasoning, E 1 and E 2 represent cause and eect respectively. For example, we have E 1 = eating cheesy cakes and E 2 = getting happy, Pr(E 2 E 1 ) means that how likely E 1 may cause the binary event E 2 (true or alse). In the same model, we may add more nodes to model dierent eects rom the same E 1, e.g., E 2 = growing at. Moreover, we (L+1) may add 5 sotmax nodes to model a multi-valued event, e.g., E 2 = happiness (scale rom (L) 1 to 5). Similarly, or knowledge triple classiication o multi-relationtransering data, given one triple (e i, r k, e j ), E 1 (L) (L) consists o the head entity (subject) e i and relation (predicate) r (2) k, and E 2 is a binary event indicating (2) whether (2) the tail entity (object) e j is true or alse. Finally, the applications o recognizing (1) lexical or textual entailment, (1) E 1 and E 2 may be (1) deined as premise and hypothesis. More generally, NAMs can be used to model an ininite number o events E 2, where each point in a continuous represents a possible event. Head entity vector New Relation Head entity vector In this work, or simplicity, we only consider NAMs or a inite number o binary events E 2 but the ormulation can be easily extended to more general cases. Comparing with traditional Deep methods, Neural Networkslike ayesian networks, NAMs employ neural nets as a universal approximator to directly model individual pair-wise event association Event E1 Event E2 probabilities without relying on explicit dependency structure among them. Thereore, NAMs can be end-to-end learned Association in DNNs purely rom training samples without P(E2 E1) strong human prior knowledge, potentially more scalable to real-world tasks. Learning o NAMs Assume we have a set o N d observed examples (event pairs {E 1, E 2 }), D, each o which is denoted as x n. This training set normally includes both positive and negative samples. e denote all positive samples (E 2 = true) as D + and all negative samples (E 2 = alse) as D. Under the same independence assumption in SRL, the log likelihood unction o a NAM model can be expressed as ollows: L(Θ) = ln (x + n ; Θ) + ln(1 (x n ; Θ)) x + n D + x n D (3) where (x n ; Θ) denotes a logistic score unction derived by the NAM or each x n, which numerically computes the conditional probability Pr(E 2 E 1 ). More details on ( ) are given later. Stochastic gradient descent (SGD) methods may be used to maximize the above likelihood unction, leading to maximum likelihood estimation (MLE) or NAMs. In the ollowing, as two case studies, we consider two NAM structures with a inite number o output nodes to model Pr(E 2 E 1 ) or any pair o events, where we have only a inite number o E 2 and each E 2 is binary. The irst model is a typical DNN that associates antecedent event (E 1 ) at input and consequent event (E 2 ) at output. Moreover, we present another model structure, called relation-modulated neural nets, which is more suitable or multi-relational data. 3.2 DNN or NAMs The irst NAM structure is a traditional DNN as shown in Figure 2. Here we use multi-relational data in K or illustration. Given a K triple x n = (e i, r k, e j ) and its corresponding label y n (true or alse), we cast E 1 = (e i, r k ) and E 2 = e j to compute Pr(E 2 E 1 ) as ollows. Relation vector Association at here (1) (L) out: z (L) In: a (L) out: z (2) out: z (1) In: a (1) Head entity vector Figure 2: The DNN structure or NAMs. Firstly, we represent head entity phrase e i and tail entity phrase e j by two embedding vectors v (1) i ( V (1) ) and v (2) j ( V (2) ). Similarly, relation r k is also represented by a low-dimensional vector c k C, which we call it relation code hereater. Secondly, (L+1) as shown Association in Figure at here 2, we combine the embeddings o the head entity e i and the relation r k to out: z eed into an (L + 1)-layer DNN as input. (L) In: The a DNN consists (L) o L rectiied linear (ReLU) (L) hidden layers [Nair (L) and Hinton, 21]. The input to DNN is written as z () = [v (1), c k ]. During the eedorward process, (2) we have out: z (2) out: z (1) In: a (1) a (l) = (l) z (l 1) + b l (l = 1,, L) (4) (1) ( (1) Relation z (l) = vector h a (l)) ( = max, a (l)) Head (l entity = 1, vector, L) (5) where (l) and b l represent the weight matrix and bias or layer l respectively. i

4 (L) (2) (1) (head) ector Event E2 Finally, we propose Association toat calculate here a sigmoid score or each triple x n = (e i, r k, e j ) as the association probability using out: z (L) the last hidden layer s output and the tail entity In: a vector v (2) (L) j : ( (L) ) (x n ; Θ) = σ z (L) v (2) out: z (2) (6) where σ( ) is the sigmoid unction, i.e., σ(x) out: z = 1/(1 + e x (1) ). In: a All network parameters o this NAM structure, (1) represented as Θ = {, V (1), V (2), C}, may (1) be jointly learned by maximizing the likelihood unction in eq. Relation vector (3). V (head) 3.3 Relation-modulated Neural Networks () Particularly or multi-relation data, ollowing the idea in [Xue et al., 214], we propose to use the so-called relationmodulated neural nets (s), as shown in Figure 3. Relation vector (L+1) (L) (2) (1) j Association at here (1) (L) out: z (L) In: a (L) out: z (2) out: z (1) In: a (1) Head entity vector Head entity vector Figure 3: The relation-modulated neural networks (). The uses the same operations as DNNs to project all entities and relations into low-dimensional continuous s. As shown in Figure 3, we connect the knowledgespeciic relation code c (k) to all hidden layers in the network. As shown later, this structure is superior in knowledge transer learning tasks. Thereore, or each layer o s, instead o using eq.(4), its linear activation signal is computed rom the previous layer z (l 1) and the relation code c (k) as ollows: a (l) = (l) z (l 1) + (l) c (k), (l = 1 L) (7) where (l) and l represent the normal weight matrix and the relation-speciic weight matrix or layer l. At the topmost layer, we propose to calculate the inal score or each triple x n = (e i, r k, e j ) using the relation code as: ( (x n ; Θ) = σ z (L) v (2) j + (L+1) c (k)). (8) In the same way, all parameters, including Θ = {,, V (1), V (2), C}, can be jointly learned based on the above maximum likelihood estimation. The models are particularly suitable or knowledge transer learning, where a pre-trained model can be quickly extended to any new relation ater observing a ew samples rom that relation. In this case, we may estimate a new relation code based on the available new samples while keeping the whole network unchanged. Due to its small size, the new relation code can be reliably estimated rom only a small number o new samples. Furthermore, model perormance in all original relations will not be aected since the model and all original relation codes are not changed in the transer learning. 4 Experiments In this section, we evaluate the proposed NAM models or various reasoning tasks. e irst describe the experimental setup and then we will report several reasoning tasks examined in this work, including textual entailment recognition, triple classiication in multi-relational Ks, commonsense reasoning and knowledge transer learning. 4.1 Experimental setup Here we irst introduce some common experimental settings used or all experiments: (1) For entity or sentence representations, we represent them by composing rom its word vectors as in [Socher et al., 213]. All word vectors are initialized rom a pre-trained skip-gram [Mikolov et al., 213] word embedding model, trained on a large English ikipedia corpus 1. The dimension or all word embeddings are set to 1 or all experiments. (2) The dimension o all relation codes are all set to 5. All relation codes are randomly initialized. (3) For network structures, we use ReLU as the nonlinear activation unction and all network parameters are initialized according to [Glorot and engio, 21]. Meanwhile, since the number o training examples or most probabilistic reasoning tasks is relative small, we adopt the dropout approach [Hinton et al., 212] during the training process to avoid the over-itting problem. (4) During the learning process o NAMs, we need to use negative samples, which are automatically generated by randomly perturbing positive K triples as D = {(e i, r k, e l ) e l e j (e i, r k, e j ) D + }. For each task, we use the provided development set to tune or the best training hyperparameters. For example, we have tested the number o hidden layers among {1, 2, 3}, the initial learning rate among {.1,.5,.1,.25,.5}, dropout rate among {,.1,.2,.3,.4}. Finally, we select the best setting based on the perormances on the development set: the inal model structure uses 2 hidden layers, and the learning rate and the dropout rate are set to.1 and.2 or all the experiments, respectively. During model training, the learning rate is halved once the perormances in the development set decreases. oth DNNs and s are trained using the stochastic gradient descend (SGD) algorithm. 2 e notice that the NAM models converge ast ater 3 epochs o learning. 4.2 Recognizing Textual Entailment Understanding entailment and contradiction is undamental to understanding natural language. Here we conduct experiments on a popular recognizing textual entailment (RTE) task, which aims to recognize the entailment relationship between a pair o English sentences. In this experiment, we 1 A small number o words do not have pre-trained vectors in the skip-gram model. e just initialize them randomly. 2 All Codes or NAMs will be made available in the uture.

5 use the SNLI dataset in [owman et al., 215] to conduct 2-class RTE experiments (entailment or contradiction). The SNLI dataset contains hundreds o thousands training examples, which is used to train a NAM model. Since this data set is not or multi-relational data, we only investigate the DNN structure or this task. The inal NAM result, along with the baseline perormance provided in [owman et al., 215], is listed in Table 2. Model Accuracy (%) Edit Distance [owman et al., 215] 71.9 Classiier [owman et al., 215] 72.2 Lexical Resources [owman et al., 215] 75. DNN 84.7 Table 2: Experimental results on the RTE task. From the results, we can see the proposed DNN based NAM model achieves considerable improvements over various traditional methods. This indicates that we can better model entailment relationship in natural language by representing sentences in continuous and conducting probabilistic reasoning with deep neural networks. 4.3 Triple classiication in multi-relational Ks In this section, we evaluate the proposed NAM models on two popular knowledge triple classiication datasets, namely N11 and F13 in [Socher et al., 213] (derived rom ord- Net and Freease), to predict whether some new triple relations hold based on other training acts in the database. The N11 dataset contains 38,696 unique entities involving 11 dierent relations in total while the F13 dataset covers 13 relations and 75,43 entities. Table 3 summarizes the statistics o these two data sets. Dataset # R # Ent # Train # Dev # Test N , ,581 2,69 1,544 F ,43 316,232 5,98 23,733 Table 3: The statistics or Ks triple classiication datasets. #R is the number o relations. #Ent is the size o entity set. The goal o knowledge triple classiication is to predict whether a given triple x n = (e i, r k, e j ) is correct or not. e irst use the training data to learn NAM models. Aterwards, we use the development set to tune a global threshold T to make a binary decision: the triple is classiied as true i (x n ; Θ) T ; otherwise it is alse. The inal accuracy is calculated based on how many triplets in the test set are classiied correctly. Experimental results on both N11 and F13 datasets are given in Table 4, where we compare the two NAM models with all other methods reported on these two data sets. The results clearly show that both NAM methods (DNN and ) achieve comparable perormance on these triple classiication tasks, and both yield consistent improvement over all existing methods. Particularly, the model gives 3.7% and 1.9% absolute improvements over the popular neural tensor networks (NTN) [Socher et al., 213] on N11 and F13 respectively. Model N11 F13 Avg. SME [ordes et al., 212] TransE [ordes et al., 213] TransH [ang et al., 214] TransR [Lin et al., 215] NTN [Socher et al., 213] DNN Table 4: Triple classiication accuracy o various methods in N11 and F13. Note that both DNN and models are much smaller than NTN in number o parameters and they scale well as the number o relation types increases. For example, either DNN or or N11 has about 7.8 millions o parameters while NTN has about 15 millions. Although RESCAL or TransE has about 4 millions o parameters or N11, their size goes up quickly or other tasks o thousands or more relation types. In addition, the training time o DNN and is much shorter than that o NTN or TransE since our models converge much aster. For example, we have obtained at least 5 times speedup over NTN in N Commonsense Reasoning Similar to the triple classiication task [Socher et al., 213], in this work, we use the ConceptNet K [Liu and Singh, 24] to construct a new commonsense data set, named as CN14 hereater 3. hen building CN14, we irst select all acts in ConceptNet related to 14 typical commonsense relations, e.g., UsedFor, CapableO. (see Figure 4 or all 14 relations.) Then, we randomly divide the extracted acts into three sets, Train, Dev and Test. Finally, in order to create a test set or classiication, we randomly switch entities (in the whole vocabulary) rom correct triples and result in a total o 2 #Test triples (hal is positive samples and hal is negative examples). The statistics o CN14 are given in Table 5. Dataset # R # Ent. # Train # Dev # Test CN ,135 2,198 5, 1, Table 5: The statistics or the CN14 dataset. The CN14 dataset is designed or answering commonsense questions like Is a camel capable o journeying across desert? The proposed NAM models answer this question by calculating the association probability Pr(E 2 E 1 ) where E 1 = {camel, capable o} and E 2 = journey across desert. In this paper, we compare two NAM methods with the popular NTN method in [Socher et al., 213] on this data set and the overall results are given in Table 6. e can see that both NAM methods signiicantly outperorm NTN in this task, and the DNN and models obtain similar perormance. Furthermore, we show the classiication accuracy o all 14 relations in CN14 or RNMM and NTN in Figure 4, which 3 e will publicly release this data set soon under the ConceptNet licence.

6 Model Positive Negative total NTN DNN Table 6: Accuracy (in %) comparison on CN14. SymbolO DesireO Createdy HasLastSubevent Desires CausesDesire ReceivesAction MotivatedyGoal Causes HasProperty HasPrerequisite HasSubevent CapableO UsedFor NTN Figure 4: Accuracy o dierent relations in CN14. show that the accuracy o varies among dierent relations rom 8.1% (Desires) to 93.5% (Createdy). e notice some commonsense relations (such as Desires, CapableO, HasSubevent) are harder than the others (like Createdy, CausesDesire, MotivatedyGoal). In general, RNMM signiicantly overtakes NTN in almost all relations. 4.5 Knowledge Transer Learning DNN 5% 1% 15% 2% 25% 5% 75% 1% Figure 5: Accuracy (in %) on the test set o a new relation CausesDesire is shown as a unction o used training samples rom CausesDesire when updating the relation code only. (Accuracy on the original relations remains as 85.6%.) Knowledge transer between various domains is a characteristic eature and crucial cornerstone o human learning. In this section, we evaluate the proposed NAM models or a knowledge transer learning scenario, where we adapt a pretrained model to an unseen relation with only a ew training samples rom the new relation. Here we randomly select a relation, e.g., CausesDesire in CN14 or this experiment. This relation contains only 48 training samples and 48 test samples. During the experiments, we use all other 13 relations in CN14 to train baseline NAM models (both DNN and ). During the transer learning, we reeze all NAM parameters, including all weights and entity representations, and only learn a new relation code or CausesDesire rom the given samples. At last, the learned relation code (along with the original NAM models) is used to classiy the new samples o CausesDesire in the test set. Obviously, this transer learning does not aect the model perormance in the original 13 relations because the models are not changed. Figure 5 show the results o knowledge transer learning or the relation CausesDesire as we increase the training samples gradually. The result shows that RNMM perorms much better than DNN in this experiment, where we can signiicantly improve RNMM or the new relation with only 5-2% o the total training samples or CausesDesire. This demonstrates that the structure to connect the relation code to all hidden layers leads to more eective learning o new relation codes rom a relatively small number o training samples. Next, we also experiment a more aggressive learning strategy or this transer learning setting, where we simultaneously update all the network parameters during the learning o the new relation code. The results are shown in Figure 6. This strategy can deinitely improve perormance more on the new relation, especially when we add more training samples. However, as expected, the perormance on the original 13 relations will deteriorate. The DNN may improve the perormance on the new relation as we use all training samples (up to 94.6%). However, the perormance on the remain 13 original relations drops dramatically rom 85.6% to 75.5%. Once again, shows the advantage over DNN in this transer learning setting, where the accuracy on the new relation increases rom 77.9% to 9.8% but the accuracy on the original 13 relations only drop slightly rom 85.9% to 82.% DNN 5% 1% 15% 2% 25% 5% 75% 1% DNN 5% 1% 15% 2% 25% 5% 75% 1% Figure 6: Transer learning results by updating all network parameters. The let igure shows results on the new relation while the right igure shows results on the original relations. 5 Conclusions In this paper, we have proposed neural association models (NAM) or probabilistic reasoning. e use neural networks to model association probabilities between any two events in a domain. In this work, we have investigated two model structures, namely DNN and, or NAMs. Experimental results on several reasoning tasks have shown that both DNNs and s can signiicantly outperorm the existing methods. This paper also reports some preliminary results to use NAMs or knowledge transer learning. e have ound that the proposed model can be quickly adapted to a new relation without sacriicing the perormance in the original relations.

7 Reerences [engio et al., 23] Yoshua engio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language model. The Journal o Machine Learning Research, 3: , 23. [ordes et al., 212] Antoine ordes, Xavier Glorot, Jason eston, and Yoshua engio. Joint learning o words and meaning representations or open-text semantic parsing. In Proceedings o AISTATS, pages , 212. [ordes et al., 213] Antoine ordes, Nicolas Usunier, Alberto Garcia-Duran, Jason eston, and Oksana Yakhnenko. Translating embeddings or modeling multirelational data. In Proceedings o NIPS, pages , 213. [owman et al., 215] Samuel R owman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated corpus or learning natural language inerence. arxiv preprint arxiv: , 215. [Collobert et al., 211] Ronan Collobert, Jason eston, Léon ottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) rom scratch. The Journal o Machine Learning Research, 12: , 211. [Glorot and engio, 21] Xavier Glorot and Yoshua engio. Understanding the diiculty o training deep eedorward neural networks. In Proceedings o AISTATS, pages , 21. [Hinton et al., 212] Georey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation o eature detectors. arxiv preprint arxiv:127.58, 212. [Jenatton et al., 212] Rodolphe Jenatton, Nicolas L Roux, Antoine ordes, and Guillaume R Obozinski. A latent actor model or highly multi-relational data. In Proceedings o NIPS, pages , 212. [Jensen, 1996] Finn V Jensen. An introduction to ayesian networks, volume 21. UCL press London, [Koller and Friedman, 29] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 29. [Lin et al., 215] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity and relation embeddings or knowledge graph completion. In Proceedings o AAAI, 215. [Liu and Singh, 24] Hugo Liu and Push Singh. Conceptnetła practical commonsense reasoning tool-kit. T technology journal, 22(4): , 24. [McCarthy, 1986] John McCarthy. Applications o circumscription to ormalizing common-sense knowledge. Artiicial Intelligence, 28(1):89 116, [Mikolov et al., 213] Tomas Mikolov, Kai Chen, Greg Corrado, and Jerey Dean. Eicient estimation o word representations in vector. arxiv preprint arxiv: , 213. [Minsky, 1988] Marvin Minsky. Society o mind. Simon and Schuster, [Mueller, 214] Erik T Mueller. Commonsense Reasoning: An Event Calculus ased Approach. Morgan Kaumann, 214. [Nair and Hinton, 21] Vinod Nair and Georey E Hinton. Rectiied linear units improve restricted boltzmann machines. In Proceedings o ICML, pages , 21. [Neapolitan, 212] Richard E Neapolitan. Probabilistic reasoning in expert systems: theory and algorithms. CreateSpace Independent Publishing Platorm, 212. [Nickel et al., 212] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. Factorizing YAGO: scalable machine learning or linked data. In Proceedings o, pages ACM, 212. [Nickel et al., 215] Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. A review o relational machine learning or knowledge graphs. arxiv preprint arxiv: , 215. [Pearl, 1988] Judea Pearl. Probabilistic reasoning in intelligent systems: Networks o plausible reasoning, [Richardson and Domingos, 26] Matthew Richardson and Pedro Domingos. Markov logic networks. Machine learning, 62(1-2):17 136, 26. [Socher et al., 213] Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. Reasoning with neural tensor networks or knowledge base completion. In Proceedings o NIPS, pages , 213. [Sutskever et al., 29] Ilya Sutskever, Joshua Tenenbaum, and Ruslan R Salakhutdinov. Modelling relational data using bayesian clustered tensor actorization. In Proceedings o NIPS, pages , 29. [ang et al., 214] Zhen ang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge graph embedding by translating on hyperplanes. In Proceedings o AAAI, pages Citeseer, 214. [Xue et al., 214] Shaoei Xue, Ossama Abdel-Hamid, Hui Jiang, Lirong Dai, and Qingeng Liu. Fast adaptation o deep neural network based on discriminant codes or speech recognition. Audio, Speech, and Language Processing, IEEE/ACM Trans. on, 22(12): , 214.

Probabilistic Reasoning via Deep Learning: Neural Association Models

Probabilistic Reasoning via Deep Learning: Neural Association Models Quan Liu and Hui Jiang and Zhen-Hua Ling and Si ei and Yu Hu National Engineering Laboratory or Speech and Language Inormation Processing