arxiv: v1 [cs.ai] 24 Mar 2016

Size: px
Start display at page:

Download "arxiv: v1 [cs.ai] 24 Mar 2016"

Transcription

1 arxiv: v1 [cs.ai] 24 Mar 216 Probabilistic Reasoning via Deep Learning: Neural Association Models Quan Liu and Hui Jiang and Zhen-Hua Ling and Si ei and Yu Hu National Engineering Laboratory or Speech and Language Inormation Processing University o Science and Technology o China, Heei, Anhui, China Department o Electrical Engineering and Computer Science, York University, Canada iflytek Research, Heei, China s: quanliu@mail.ustc.edu.cn, hj@cse.yorku.ca, zhling@ustc.edu.cn siwei@ilytek.com, yuhu@ilytek.com Abstract In this paper, we propose a new deep learning approach, called neural association model (NAM), or probabilistic reasoning in artiicial intelligence. e propose to use neural networks to model association between any two events in a domain. Neural networks take one event as input and compute a conditional probability o the other event to model how likely these two events are associated. The actual meaning o the conditional probabilities varies between applications and depends on how the models are trained. In this work, as two case studies, we have investigated two NAM structures, namely deep neural networks (DNNs) and relation-modulated neural nets (s), on several probabilistic reasoning tasks in AI, including recognizing textual entailment, triple classiication in multi-relational knowledge bases and commonsense reasoning. Experimental results on several popular data sets derived rom ordnet, Free- ase and ConceptNet have all demonstrated that both DNNs and s perorm equally well and they can signiicantly outperorm the conventional methods available or these reasoning tasks. Moreover, comparing with DNNs, s are superior in knowledge transer, where a pre-trained model can be quickly extended to an unseen relation ater observing only a ew training samples. 1 Introduction Reasoning is an important topic in artiicial intelligence (AI), which has attracted considerable attention and research eort in the past ew decades [McCarthy, 1986; Minsky, 1988; Mueller, 214]. esides the traditional logic reasoning, probabilistic reasoning has been studied as another typical genre in order to handle knowledge uncertainty in reasoning based on the probability theory [Pearl, 1988; Neapolitan, 212]. The probabilistic reasoning aims to predict the conditional probability Pr(E 2 E 1 ) o one event E 2 given another event E 1. State-o-the-art methods or probability reasoning include ayesian networks [Jensen, 1996], Markov logic networks [Richardson and Domingos, 26] and other graphical models [Koller and Friedman, 29]. Taking ayesian networks as example, the conditional probabilities between two associated events are calculated as posterior probabilities according to the ayes theorem, where all possible events are modeled by a pre-deined graph structure. However, these methods quickly become intractable or most practical tasks where the number o all possible events is large. In recent years, distributed representations that map discrete language units into continuous vector s have gained signiicant popularity along with the development o neural networks [engio et al., 23; Collobert et al., 211; Mikolov et al., 213]. The main beneit or embedding in continuous is its smooth property to capture the semantic relatedness between discrete events, potentially generalizable to unseen events. Similar ideas, such as knowledge graph embedding, have been proposed to represent knowledge bases (K) in low-dimensional continuous [ordes et al., 213; Socher et al., 213; ang et al., 214; Nickel et al., 215]. Using the smoothed K representation, it is possible to reason over the relations among various entities. However, human-like reasoning remains as an extremely challenging problem because it requires to eectively encode world knowledge using powerul models. Most o the existing Ks are quite sparse and can only capture a raction o world knowledge, even or those recently created large-scale Ks, such as YAGO, NELL and Freebase. In order to take advantage o these sparse knowledge bases, the state-o-theart approaches or knowledge graph embedding usually adopt simple linear models, such as RESCAL [Nickel et al., 212] or statistical relational learning (SRL). Motivated by the recent successes o deep learning in pattern recognition, in this paper, we propose to use deep neural networks, called neural association model (NAM), to represent the sparse knowledge bases. Dierent rom the existing linear models, the proposed NAM model uses multi-layer nonlinear activations in deep neural nets to model the association conditional probabilities between any two possible events. In the proposed NAM ramework, all symbolic events are represented in low-dimensional continuous and we have no need to explicitly speciy any dependency structure among events as required in ayesian networks. Deep neural networks are used to model the association between any two events, taking one event as input to compute a conditional probability o another event. The computed conditional

2 tor probability or association may be generalized to model various reasoning problems, such as entailment inerence, relational learning, causation modelling and so on. In this work, we study two model structures or NAM. The irst model is a standard deep neural networks (DNN) and the second model uses a special structure called relation modulated neural nets (). Experiments on several probabilistic reasoning tasks, including recognizing textual entailment, triple classiication in multi-relational Ks and commonsense reasoning, have demonstrated that both DNN and can signiicantly outperorm other conventional methods. Moreover, the model is shown to be eective in knowledge transer learning, where a pre-trained model can be quickly extended to a new relation ater observing only a ew training samples. The main contributions o this paper can be summarized as ollows: e propose neural association models or probabilistic reasoning based on deep neural networks, which (L+1) are general enough to deal with various reasoning problems Score unction over symbolic events. One speciic model structure studied (L) in this work (L) is quiteout: eective z (L) or knowledge transer learning, which In: a can quickly (L) adapt an existing knowledge base to newly encountered scenarios and contexts. (L) To the best o our knowledge, this is the irst work to demonstrate out: z (2) that DNNs with the multi-layer nonlinearity are useul or probabilistic reasoning. 2 A Review on Statistical Relation Learning This paperout: mainly z (1) ocuses on probabilistic reasoning on multi-relational In: a (1) data. e irst review the ormulation o statistical relational learning (SRL) as in [Nickel et al., 215]. Let (1) Existed Relations Head entity vector E = {e 1,..., e Ne } be the set o all entities and R = {r 1,..., r Nr } be the set o all relation types in K, where N e is the total entity number and N r the relation number. SRL Head entity attempts vector to model each possible triple x ijk = (e i, r k, e j ) over the sets o entities and relations as a binary random variable y ijk {, 1} that indicates true or alse. All possible triples in E R E may be ormed as a third-order tensor Y {, 1} Ne Ne Nr. SRL aims to estimate the joint distribution o Y, Pr(Y), rom a training set D o all observed triples. Most SRL models assume that all y ijk are conditionally independent given the model parameters Θ. As a result, relying on a score unction (x ijk ; Θ) or each triple x ijk = (e i, r k, e j ), the likelihood unction o an SRL model may be derived as ollows: N e N e P(Y D, Θ) = N r i=1 j=1 k=1 er (y ijk σ ((x ijk ; Θ))) (1) where σ( ) denotes the sigmoid unction and er (y p) the ernoulli distribution as: { p, i y = 1 er (y p) = (2) 1 p, i y = Under the above ormulation, the key work is to deine the score unction or each triple. In [Nickel et al., 212], (2) (1) (1) they have proposed RESCAL to explain triples via pairwise interactions o latent eatures. In [Sutskever et al., 29; Jenatton et al., 212], a latent actor model is designed to capture the second-order correlation between entities using a quadratic orm. The work in [Lin et al., 215] attempts to learn the latent embeddings by projecting all entities rom the entity to the corresponding relation and then building translations among the projected entities. In [ordes et al., 212], they have proposed a model based on semantic matching energy to capture the correlation between entities and relations via multiple matrix products and Hadamard products. In [Socher et al., 213], neural tensor model (NTM) is constructed using a very expressive score unction with a large number o parameters. TransE presented in [ordes et al., 213] deines a linear unction or modeling triples in a joint o entities and relations. Finally, [ang et al., 214] propose to construct the score unction by projecting entities into a relation-speciic hyperplane. Obviously, almost all existing methods in the literature (L+1) adopt linear models or statistical relation learning. 3 Neural Association Models (L) (NAM) In this paper, we propose to use a nonlinear model, namely deep neural nets, or probabilistic reasoning. Our main goal is to use neural nets to model the association (2) probability (2) or any two events E 1 and E 2 in a domain, i.e., Pr(E 2 E 1 ) o E 2 conditioning on E 1. All possible events in the domain are projected into continuous (1) without speciying (1) any explicit dependency structure among them. In the ollowing, we irst introduce neural association models (NAM) as a general modeling ramework or probabilistic reasoning in section 3.1. Next, we describe two particular NAM Tail entity model vector structuresnew orrelation multi-relational Head data entity invector sections 3.2 and NAM in general Event E 1 Transering Deep Neural Networks Association in DNNs P(E 2 E 1 ) Figure 1: The NAM ramework in general. (L) Event E 2 Figure 1 shows the general ramework o NAM or associating two events, E 1 and E 2. In the general NAM ramework, the events are irst projected into a low-dimension continuous. Deep neural networks with multi-layer nonlinearity are used to model how likely these two events are associated. Neural networks take the embedding o one event E 1 (antecedent) as input and computes a conditional probability Pr(E 2 E 1 ) o the other event E 2 (consequent). I the event E 2 is binary (true or alse), the NAM models may use a sigmoid node to compute Pr(E 2 E 1 ). I E 2 takes multiple

3 Existed Relations (L+1) (L) (2) (1) mutually exclusive values, we use a ew sotmax nodes or Pr(E 2 E 1 ), where it may need to use multiple embeddings or E 2 (one per value). NAMs do not explicitly speciy how dierent events E 2 are actually related, they may be mutually exclusive, contained, intersected. NAMs are only used to separately compute conditional probabilities, Pr(E 2 E 1 ), or each pair o events, E 1 and E 2, in a task. The actual physical meaning o the conditional probabilities Pr(E 2 E 1 ) varies between applications and depends on how the models are trained. Table 1 lists a ew possible applications or NAMs. Application E 1 E 2 language modeling h w causal reasoning cause eect knowledge triple classiication {e i, r k } e j lexical entailment 1 2 textual entailment D 1 D 2 Table 1: Some applications or NAMs. In language modeling, the antecedent event is the representation o history context, h, and the consequent event is next word w that takes one out o K values. In causal reasoning, E 1 and E 2 represent cause and eect respectively. For example, we have E 1 = eating cheesy cakes and E 2 = getting happy, Pr(E 2 E 1 ) means that how likely E 1 may cause the binary event E 2 (true or alse). In the same model, we may add more nodes to model dierent eects rom the same E 1, e.g., E 2 = growing at. Moreover, we (L+1) may add 5 sotmax nodes to model a multi-valued event, e.g., E 2 = happiness (scale rom (L) 1 to 5). Similarly, or knowledge triple classiication o multi-relationtransering data, given one triple (e i, r k, e j ), E 1 (L) (L) consists o the head entity (subject) e i and relation (predicate) r (2) k, and E 2 is a binary event indicating (2) whether (2) the tail entity (object) e j is true or alse. Finally, the applications o recognizing (1) lexical or textual entailment, (1) E 1 and E 2 may be (1) deined as premise and hypothesis. More generally, NAMs can be used to model an ininite number o events E 2, where each point in a continuous represents a possible event. Head entity vector New Relation Head entity vector In this work, or simplicity, we only consider NAMs or a inite number o binary events E 2 but the ormulation can be easily extended to more general cases. Comparing with traditional Deep methods, Neural Networkslike ayesian networks, NAMs employ neural nets as a universal approximator to directly model individual pair-wise event association Event E1 Event E2 probabilities without relying on explicit dependency structure among them. Thereore, NAMs can be end-to-end learned Association in DNNs purely rom training samples without P(E2 E1) strong human prior knowledge, potentially more scalable to real-world tasks. Learning o NAMs Assume we have a set o N d observed examples (event pairs {E 1, E 2 }), D, each o which is denoted as x n. This training set normally includes both positive and negative samples. e denote all positive samples (E 2 = true) as D + and all negative samples (E 2 = alse) as D. Under the same independence assumption in SRL, the log likelihood unction o a NAM model can be expressed as ollows: L(Θ) = ln (x + n ; Θ) + ln(1 (x n ; Θ)) x + n D + x n D (3) where (x n ; Θ) denotes a logistic score unction derived by the NAM or each x n, which numerically computes the conditional probability Pr(E 2 E 1 ). More details on ( ) are given later. Stochastic gradient descent (SGD) methods may be used to maximize the above likelihood unction, leading to maximum likelihood estimation (MLE) or NAMs. In the ollowing, as two case studies, we consider two NAM structures with a inite number o output nodes to model Pr(E 2 E 1 ) or any pair o events, where we have only a inite number o E 2 and each E 2 is binary. The irst model is a typical DNN that associates antecedent event (E 1 ) at input and consequent event (E 2 ) at output. Moreover, we present another model structure, called relation-modulated neural nets, which is more suitable or multi-relational data. 3.2 DNN or NAMs The irst NAM structure is a traditional DNN as shown in Figure 2. Here we use multi-relational data in K or illustration. Given a K triple x n = (e i, r k, e j ) and its corresponding label y n (true or alse), we cast E 1 = (e i, r k ) and E 2 = e j to compute Pr(E 2 E 1 ) as ollows. Relation vector Association at here (1) (L) out: z (L) In: a (L) out: z (2) out: z (1) In: a (1) Head entity vector Figure 2: The DNN structure or NAMs. Firstly, we represent head entity phrase e i and tail entity phrase e j by two embedding vectors v (1) i ( V (1) ) and v (2) j ( V (2) ). Similarly, relation r k is also represented by a low-dimensional vector c k C, which we call it relation code hereater. Secondly, (L+1) as shown Association in Figure at here 2, we combine the embeddings o the head entity e i and the relation r k to out: z eed into an (L + 1)-layer DNN as input. (L) In: The a DNN consists (L) o L rectiied linear (ReLU) (L) hidden layers [Nair (L) and Hinton, 21]. The input to DNN is written as z () = [v (1), c k ]. During the eedorward process, (2) we have out: z (2) out: z (1) In: a (1) a (l) = (l) z (l 1) + b l (l = 1,, L) (4) (1) ( (1) Relation z (l) = vector h a (l)) ( = max, a (l)) Head (l entity = 1, vector, L) (5) where (l) and b l represent the weight matrix and bias or layer l respectively. i

4 (L) (2) (1) (head) ector Event E2 Finally, we propose Association toat calculate here a sigmoid score or each triple x n = (e i, r k, e j ) as the association probability using out: z (L) the last hidden layer s output and the tail entity In: a vector v (2) (L) j : ( (L) ) (x n ; Θ) = σ z (L) v (2) out: z (2) (6) where σ( ) is the sigmoid unction, i.e., σ(x) out: z = 1/(1 + e x (1) ). In: a All network parameters o this NAM structure, (1) represented as Θ = {, V (1), V (2), C}, may (1) be jointly learned by maximizing the likelihood unction in eq. Relation vector (3). V (head) 3.3 Relation-modulated Neural Networks () Particularly or multi-relation data, ollowing the idea in [Xue et al., 214], we propose to use the so-called relationmodulated neural nets (s), as shown in Figure 3. Relation vector (L+1) (L) (2) (1) j Association at here (1) (L) out: z (L) In: a (L) out: z (2) out: z (1) In: a (1) Head entity vector Head entity vector Figure 3: The relation-modulated neural networks (). The uses the same operations as DNNs to project all entities and relations into low-dimensional continuous s. As shown in Figure 3, we connect the knowledgespeciic relation code c (k) to all hidden layers in the network. As shown later, this structure is superior in knowledge transer learning tasks. Thereore, or each layer o s, instead o using eq.(4), its linear activation signal is computed rom the previous layer z (l 1) and the relation code c (k) as ollows: a (l) = (l) z (l 1) + (l) c (k), (l = 1 L) (7) where (l) and l represent the normal weight matrix and the relation-speciic weight matrix or layer l. At the topmost layer, we propose to calculate the inal score or each triple x n = (e i, r k, e j ) using the relation code as: ( (x n ; Θ) = σ z (L) v (2) j + (L+1) c (k)). (8) In the same way, all parameters, including Θ = {,, V (1), V (2), C}, can be jointly learned based on the above maximum likelihood estimation. The models are particularly suitable or knowledge transer learning, where a pre-trained model can be quickly extended to any new relation ater observing a ew samples rom that relation. In this case, we may estimate a new relation code based on the available new samples while keeping the whole network unchanged. Due to its small size, the new relation code can be reliably estimated rom only a small number o new samples. Furthermore, model perormance in all original relations will not be aected since the model and all original relation codes are not changed in the transer learning. 4 Experiments In this section, we evaluate the proposed NAM models or various reasoning tasks. e irst describe the experimental setup and then we will report several reasoning tasks examined in this work, including textual entailment recognition, triple classiication in multi-relational Ks, commonsense reasoning and knowledge transer learning. 4.1 Experimental setup Here we irst introduce some common experimental settings used or all experiments: (1) For entity or sentence representations, we represent them by composing rom its word vectors as in [Socher et al., 213]. All word vectors are initialized rom a pre-trained skip-gram [Mikolov et al., 213] word embedding model, trained on a large English ikipedia corpus 1. The dimension or all word embeddings are set to 1 or all experiments. (2) The dimension o all relation codes are all set to 5. All relation codes are randomly initialized. (3) For network structures, we use ReLU as the nonlinear activation unction and all network parameters are initialized according to [Glorot and engio, 21]. Meanwhile, since the number o training examples or most probabilistic reasoning tasks is relative small, we adopt the dropout approach [Hinton et al., 212] during the training process to avoid the over-itting problem. (4) During the learning process o NAMs, we need to use negative samples, which are automatically generated by randomly perturbing positive K triples as D = {(e i, r k, e l ) e l e j (e i, r k, e j ) D + }. For each task, we use the provided development set to tune or the best training hyperparameters. For example, we have tested the number o hidden layers among {1, 2, 3}, the initial learning rate among {.1,.5,.1,.25,.5}, dropout rate among {,.1,.2,.3,.4}. Finally, we select the best setting based on the perormances on the development set: the inal model structure uses 2 hidden layers, and the learning rate and the dropout rate are set to.1 and.2 or all the experiments, respectively. During model training, the learning rate is halved once the perormances in the development set decreases. oth DNNs and s are trained using the stochastic gradient descend (SGD) algorithm. 2 e notice that the NAM models converge ast ater 3 epochs o learning. 4.2 Recognizing Textual Entailment Understanding entailment and contradiction is undamental to understanding natural language. Here we conduct experiments on a popular recognizing textual entailment (RTE) task, which aims to recognize the entailment relationship between a pair o English sentences. In this experiment, we 1 A small number o words do not have pre-trained vectors in the skip-gram model. e just initialize them randomly. 2 All Codes or NAMs will be made available in the uture.

5 use the SNLI dataset in [owman et al., 215] to conduct 2-class RTE experiments (entailment or contradiction). The SNLI dataset contains hundreds o thousands training examples, which is used to train a NAM model. Since this data set is not or multi-relational data, we only investigate the DNN structure or this task. The inal NAM result, along with the baseline perormance provided in [owman et al., 215], is listed in Table 2. Model Accuracy (%) Edit Distance [owman et al., 215] 71.9 Classiier [owman et al., 215] 72.2 Lexical Resources [owman et al., 215] 75. DNN 84.7 Table 2: Experimental results on the RTE task. From the results, we can see the proposed DNN based NAM model achieves considerable improvements over various traditional methods. This indicates that we can better model entailment relationship in natural language by representing sentences in continuous and conducting probabilistic reasoning with deep neural networks. 4.3 Triple classiication in multi-relational Ks In this section, we evaluate the proposed NAM models on two popular knowledge triple classiication datasets, namely N11 and F13 in [Socher et al., 213] (derived rom ord- Net and Freease), to predict whether some new triple relations hold based on other training acts in the database. The N11 dataset contains 38,696 unique entities involving 11 dierent relations in total while the F13 dataset covers 13 relations and 75,43 entities. Table 3 summarizes the statistics o these two data sets. Dataset # R # Ent # Train # Dev # Test N , ,581 2,69 1,544 F ,43 316,232 5,98 23,733 Table 3: The statistics or Ks triple classiication datasets. #R is the number o relations. #Ent is the size o entity set. The goal o knowledge triple classiication is to predict whether a given triple x n = (e i, r k, e j ) is correct or not. e irst use the training data to learn NAM models. Aterwards, we use the development set to tune a global threshold T to make a binary decision: the triple is classiied as true i (x n ; Θ) T ; otherwise it is alse. The inal accuracy is calculated based on how many triplets in the test set are classiied correctly. Experimental results on both N11 and F13 datasets are given in Table 4, where we compare the two NAM models with all other methods reported on these two data sets. The results clearly show that both NAM methods (DNN and ) achieve comparable perormance on these triple classiication tasks, and both yield consistent improvement over all existing methods. Particularly, the model gives 3.7% and 1.9% absolute improvements over the popular neural tensor networks (NTN) [Socher et al., 213] on N11 and F13 respectively. Model N11 F13 Avg. SME [ordes et al., 212] TransE [ordes et al., 213] TransH [ang et al., 214] TransR [Lin et al., 215] NTN [Socher et al., 213] DNN Table 4: Triple classiication accuracy o various methods in N11 and F13. Note that both DNN and models are much smaller than NTN in number o parameters and they scale well as the number o relation types increases. For example, either DNN or or N11 has about 7.8 millions o parameters while NTN has about 15 millions. Although RESCAL or TransE has about 4 millions o parameters or N11, their size goes up quickly or other tasks o thousands or more relation types. In addition, the training time o DNN and is much shorter than that o NTN or TransE since our models converge much aster. For example, we have obtained at least 5 times speedup over NTN in N Commonsense Reasoning Similar to the triple classiication task [Socher et al., 213], in this work, we use the ConceptNet K [Liu and Singh, 24] to construct a new commonsense data set, named as CN14 hereater 3. hen building CN14, we irst select all acts in ConceptNet related to 14 typical commonsense relations, e.g., UsedFor, CapableO. (see Figure 4 or all 14 relations.) Then, we randomly divide the extracted acts into three sets, Train, Dev and Test. Finally, in order to create a test set or classiication, we randomly switch entities (in the whole vocabulary) rom correct triples and result in a total o 2 #Test triples (hal is positive samples and hal is negative examples). The statistics o CN14 are given in Table 5. Dataset # R # Ent. # Train # Dev # Test CN ,135 2,198 5, 1, Table 5: The statistics or the CN14 dataset. The CN14 dataset is designed or answering commonsense questions like Is a camel capable o journeying across desert? The proposed NAM models answer this question by calculating the association probability Pr(E 2 E 1 ) where E 1 = {camel, capable o} and E 2 = journey across desert. In this paper, we compare two NAM methods with the popular NTN method in [Socher et al., 213] on this data set and the overall results are given in Table 6. e can see that both NAM methods signiicantly outperorm NTN in this task, and the DNN and models obtain similar perormance. Furthermore, we show the classiication accuracy o all 14 relations in CN14 or RNMM and NTN in Figure 4, which 3 e will publicly release this data set soon under the ConceptNet licence.

6 Model Positive Negative total NTN DNN Table 6: Accuracy (in %) comparison on CN14. SymbolO DesireO Createdy HasLastSubevent Desires CausesDesire ReceivesAction MotivatedyGoal Causes HasProperty HasPrerequisite HasSubevent CapableO UsedFor NTN Figure 4: Accuracy o dierent relations in CN14. show that the accuracy o varies among dierent relations rom 8.1% (Desires) to 93.5% (Createdy). e notice some commonsense relations (such as Desires, CapableO, HasSubevent) are harder than the others (like Createdy, CausesDesire, MotivatedyGoal). In general, RNMM signiicantly overtakes NTN in almost all relations. 4.5 Knowledge Transer Learning DNN 5% 1% 15% 2% 25% 5% 75% 1% Figure 5: Accuracy (in %) on the test set o a new relation CausesDesire is shown as a unction o used training samples rom CausesDesire when updating the relation code only. (Accuracy on the original relations remains as 85.6%.) Knowledge transer between various domains is a characteristic eature and crucial cornerstone o human learning. In this section, we evaluate the proposed NAM models or a knowledge transer learning scenario, where we adapt a pretrained model to an unseen relation with only a ew training samples rom the new relation. Here we randomly select a relation, e.g., CausesDesire in CN14 or this experiment. This relation contains only 48 training samples and 48 test samples. During the experiments, we use all other 13 relations in CN14 to train baseline NAM models (both DNN and ). During the transer learning, we reeze all NAM parameters, including all weights and entity representations, and only learn a new relation code or CausesDesire rom the given samples. At last, the learned relation code (along with the original NAM models) is used to classiy the new samples o CausesDesire in the test set. Obviously, this transer learning does not aect the model perormance in the original 13 relations because the models are not changed. Figure 5 show the results o knowledge transer learning or the relation CausesDesire as we increase the training samples gradually. The result shows that RNMM perorms much better than DNN in this experiment, where we can signiicantly improve RNMM or the new relation with only 5-2% o the total training samples or CausesDesire. This demonstrates that the structure to connect the relation code to all hidden layers leads to more eective learning o new relation codes rom a relatively small number o training samples. Next, we also experiment a more aggressive learning strategy or this transer learning setting, where we simultaneously update all the network parameters during the learning o the new relation code. The results are shown in Figure 6. This strategy can deinitely improve perormance more on the new relation, especially when we add more training samples. However, as expected, the perormance on the original 13 relations will deteriorate. The DNN may improve the perormance on the new relation as we use all training samples (up to 94.6%). However, the perormance on the remain 13 original relations drops dramatically rom 85.6% to 75.5%. Once again, shows the advantage over DNN in this transer learning setting, where the accuracy on the new relation increases rom 77.9% to 9.8% but the accuracy on the original 13 relations only drop slightly rom 85.9% to 82.% DNN 5% 1% 15% 2% 25% 5% 75% 1% DNN 5% 1% 15% 2% 25% 5% 75% 1% Figure 6: Transer learning results by updating all network parameters. The let igure shows results on the new relation while the right igure shows results on the original relations. 5 Conclusions In this paper, we have proposed neural association models (NAM) or probabilistic reasoning. e use neural networks to model association probabilities between any two events in a domain. In this work, we have investigated two model structures, namely DNN and, or NAMs. Experimental results on several reasoning tasks have shown that both DNNs and s can signiicantly outperorm the existing methods. This paper also reports some preliminary results to use NAMs or knowledge transer learning. e have ound that the proposed model can be quickly adapted to a new relation without sacriicing the perormance in the original relations.

7 Reerences [engio et al., 23] Yoshua engio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language model. The Journal o Machine Learning Research, 3: , 23. [ordes et al., 212] Antoine ordes, Xavier Glorot, Jason eston, and Yoshua engio. Joint learning o words and meaning representations or open-text semantic parsing. In Proceedings o AISTATS, pages , 212. [ordes et al., 213] Antoine ordes, Nicolas Usunier, Alberto Garcia-Duran, Jason eston, and Oksana Yakhnenko. Translating embeddings or modeling multirelational data. In Proceedings o NIPS, pages , 213. [owman et al., 215] Samuel R owman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated corpus or learning natural language inerence. arxiv preprint arxiv: , 215. [Collobert et al., 211] Ronan Collobert, Jason eston, Léon ottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) rom scratch. The Journal o Machine Learning Research, 12: , 211. [Glorot and engio, 21] Xavier Glorot and Yoshua engio. Understanding the diiculty o training deep eedorward neural networks. In Proceedings o AISTATS, pages , 21. [Hinton et al., 212] Georey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation o eature detectors. arxiv preprint arxiv:127.58, 212. [Jenatton et al., 212] Rodolphe Jenatton, Nicolas L Roux, Antoine ordes, and Guillaume R Obozinski. A latent actor model or highly multi-relational data. In Proceedings o NIPS, pages , 212. [Jensen, 1996] Finn V Jensen. An introduction to ayesian networks, volume 21. UCL press London, [Koller and Friedman, 29] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 29. [Lin et al., 215] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity and relation embeddings or knowledge graph completion. In Proceedings o AAAI, 215. [Liu and Singh, 24] Hugo Liu and Push Singh. Conceptnetła practical commonsense reasoning tool-kit. T technology journal, 22(4): , 24. [McCarthy, 1986] John McCarthy. Applications o circumscription to ormalizing common-sense knowledge. Artiicial Intelligence, 28(1):89 116, [Mikolov et al., 213] Tomas Mikolov, Kai Chen, Greg Corrado, and Jerey Dean. Eicient estimation o word representations in vector. arxiv preprint arxiv: , 213. [Minsky, 1988] Marvin Minsky. Society o mind. Simon and Schuster, [Mueller, 214] Erik T Mueller. Commonsense Reasoning: An Event Calculus ased Approach. Morgan Kaumann, 214. [Nair and Hinton, 21] Vinod Nair and Georey E Hinton. Rectiied linear units improve restricted boltzmann machines. In Proceedings o ICML, pages , 21. [Neapolitan, 212] Richard E Neapolitan. Probabilistic reasoning in expert systems: theory and algorithms. CreateSpace Independent Publishing Platorm, 212. [Nickel et al., 212] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. Factorizing YAGO: scalable machine learning or linked data. In Proceedings o, pages ACM, 212. [Nickel et al., 215] Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. A review o relational machine learning or knowledge graphs. arxiv preprint arxiv: , 215. [Pearl, 1988] Judea Pearl. Probabilistic reasoning in intelligent systems: Networks o plausible reasoning, [Richardson and Domingos, 26] Matthew Richardson and Pedro Domingos. Markov logic networks. Machine learning, 62(1-2):17 136, 26. [Socher et al., 213] Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. Reasoning with neural tensor networks or knowledge base completion. In Proceedings o NIPS, pages , 213. [Sutskever et al., 29] Ilya Sutskever, Joshua Tenenbaum, and Ruslan R Salakhutdinov. Modelling relational data using bayesian clustered tensor actorization. In Proceedings o NIPS, pages , 29. [ang et al., 214] Zhen ang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge graph embedding by translating on hyperplanes. In Proceedings o AAAI, pages Citeseer, 214. [Xue et al., 214] Shaoei Xue, Ossama Abdel-Hamid, Hui Jiang, Lirong Dai, and Qingeng Liu. Fast adaptation o deep neural network based on discriminant codes or speech recognition. Audio, Speech, and Language Processing, IEEE/ACM Trans. on, 22(12): , 214.

Probabilistic Reasoning via Deep Learning: Neural Association Models

Probabilistic Reasoning via Deep Learning: Neural Association Models Probabilistic Reasoning via Deep Learning: Neural Association Models Quan Liu and Hui Jiang and Zhen-Hua Ling and Si ei and Yu Hu National Engineering Laboratory or Speech and Language Inormation Processing

More information

ParaGraphE: A Library for Parallel Knowledge Graph Embedding

ParaGraphE: A Library for Parallel Knowledge Graph Embedding ParaGraphE: A Library for Parallel Knowledge Graph Embedding Xiao-Fan Niu, Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer Science and Technology, Nanjing University,

More information

Learning Multi-Relational Semantics Using Neural-Embedding Models

Learning Multi-Relational Semantics Using Neural-Embedding Models Learning Multi-Relational Semantics Using Neural-Embedding Models Bishan Yang Cornell University Ithaca, NY, 14850, USA bishan@cs.cornell.edu Wen-tau Yih, Xiaodong He, Jianfeng Gao, Li Deng Microsoft Research

More information

A Translation-Based Knowledge Graph Embedding Preserving Logical Property of Relations

A Translation-Based Knowledge Graph Embedding Preserving Logical Property of Relations A Translation-Based Knowledge Graph Embedding Preserving Logical Property of Relations Hee-Geun Yoon, Hyun-Je Song, Seong-Bae Park, Se-Young Park School of Computer Science and Engineering Kyungpook National

More information

2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media,

2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising

More information

TransG : A Generative Model for Knowledge Graph Embedding

TransG : A Generative Model for Knowledge Graph Embedding TransG : A Generative Model for Knowledge Graph Embedding Han Xiao, Minlie Huang, Xiaoyan Zhu State Key Lab. of Intelligent Technology and Systems National Lab. for Information Science and Technology Dept.

More information

Learning Entity and Relation Embeddings for Knowledge Graph Completion

Learning Entity and Relation Embeddings for Knowledge Graph Completion Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Learning Entity and Relation Embeddings for Knowledge Graph Completion Yankai Lin 1, Zhiyuan Liu 1, Maosong Sun 1,2, Yang Liu

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

arxiv: v1 [cs.ai] 12 Nov 2018

arxiv: v1 [cs.ai] 12 Nov 2018 Differentiating Concepts and Instances for Knowledge Graph Embedding Xin Lv, Lei Hou, Juanzi Li, Zhiyuan Liu Department of Computer Science and Technology, Tsinghua University, China 100084 {lv-x18@mails.,houlei@,lijuanzi@,liuzy@}tsinghua.edu.cn

More information

Locally Adaptive Translation for Knowledge Graph Embedding

Locally Adaptive Translation for Knowledge Graph Embedding Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Locally Adaptive Translation for Knowledge Graph Embedding Yantao Jia 1, Yuanzhuo Wang 1, Hailun Lin 2, Xiaolong Jin 1,

More information

REVISITING KNOWLEDGE BASE EMBEDDING AS TEN-

REVISITING KNOWLEDGE BASE EMBEDDING AS TEN- REVISITING KNOWLEDGE BASE EMBEDDING AS TEN- SOR DECOMPOSITION Anonymous authors Paper under double-blind review ABSTRACT We study the problem of knowledge base KB embedding, which is usually addressed

More information

Learning Embedding Representations for Knowledge Inference on Imperfect and Incomplete Repositories

Learning Embedding Representations for Knowledge Inference on Imperfect and Incomplete Repositories Learning Embedding Representations for Knowledge Inference on Imperfect and Incomplete Repositories Miao Fan, Qiang Zhou and Thomas Fang Zheng CSLT, Division of Technical Innovation and Development, Tsinghua

More information

Knowledge Graph Completion with Adaptive Sparse Transfer Matrix

Knowledge Graph Completion with Adaptive Sparse Transfer Matrix Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Knowledge Graph Completion with Adaptive Sparse Transfer Matrix Guoliang Ji, Kang Liu, Shizhu He, Jun Zhao National Laboratory

More information

arxiv: v2 [cs.cl] 28 Sep 2015

arxiv: v2 [cs.cl] 28 Sep 2015 TransA: An Adaptive Approach for Knowledge Graph Embedding Han Xiao 1, Minlie Huang 1, Hao Yu 1, Xiaoyan Zhu 1 1 Department of Computer Science and Technology, State Key Lab on Intelligent Technology and

More information

Learning Multi-Relational Semantics Using Neural-Embedding Models

Learning Multi-Relational Semantics Using Neural-Embedding Models Learning Multi-Relational Semantics Using Neural-Embedding Models Bishan Yang Cornell University Ithaca, NY, 14850, USA bishan@cs.cornell.edu Wen-tau Yih, Xiaodong He, Jianfeng Gao, Li Deng Microsoft Research

More information

Spatial Transformer. Ref: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS, 2015

Spatial Transformer. Ref: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS, 2015 Spatial Transormer Re: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transormer Networks, NIPS, 2015 Spatial Transormer Layer CNN is not invariant to scaling and rotation

More information

STransE: a novel embedding model of entities and relationships in knowledge bases

STransE: a novel embedding model of entities and relationships in knowledge bases STransE: a novel embedding model of entities and relationships in knowledge bases Dat Quoc Nguyen 1, Kairit Sirts 1, Lizhen Qu 2 and Mark Johnson 1 1 Department of Computing, Macquarie University, Sydney,

More information

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WITH IMPLICATIONS FOR TRAINING Sanjeev Arora, Yingyu Liang & Tengyu Ma Department of Computer Science Princeton University Princeton, NJ 08540, USA {arora,yingyul,tengyu}@cs.princeton.edu

More information

Modeling Relation Paths for Representation Learning of Knowledge Bases

Modeling Relation Paths for Representation Learning of Knowledge Bases Modeling Relation Paths for Representation Learning of Knowledge Bases Yankai Lin 1, Zhiyuan Liu 1, Huanbo Luan 1, Maosong Sun 1, Siwei Rao 2, Song Liu 2 1 Department of Computer Science and Technology,

More information

Modeling Relation Paths for Representation Learning of Knowledge Bases

Modeling Relation Paths for Representation Learning of Knowledge Bases Modeling Relation Paths for Representation Learning of Knowledge Bases Yankai Lin 1, Zhiyuan Liu 1, Huanbo Luan 1, Maosong Sun 1, Siwei Rao 2, Song Liu 2 1 Department of Computer Science and Technology,

More information

Knowledge Graph Embedding with Diversity of Structures

Knowledge Graph Embedding with Diversity of Structures Knowledge Graph Embedding with Diversity of Structures Wen Zhang supervised by Huajun Chen College of Computer Science and Technology Zhejiang University, Hangzhou, China wenzhang2015@zju.edu.cn ABSTRACT

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

Deep Learning for NLP

Deep Learning for NLP Deep Learning for NLP CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) Machine Learning and NLP NER WordNet Usually machine learning

More information

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1) 11/3/15 Machine Learning and NLP Deep Learning for NLP Usually machine learning works well because of human-designed representations and input features CS224N WordNet SRL Parser Machine learning becomes

More information

Web-Mining Agents. Multi-Relational Latent Semantic Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Web-Mining Agents. Multi-Relational Latent Semantic Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Web-Mining Agents Multi-Relational Latent Semantic Analysis Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Tanya Braun (Übungen) Acknowledgements Slides by: Scott Wen-tau

More information

Towards Understanding the Geometry of Knowledge Graph Embeddings

Towards Understanding the Geometry of Knowledge Graph Embeddings Towards Understanding the Geometry of Knowledge Graph Embeddings Chandrahas Indian Institute of Science chandrahas@iisc.ac.in Aditya Sharma Indian Institute of Science adityasharma@iisc.ac.in Partha Talukdar

More information

IMPROVING STOCHASTIC GRADIENT DESCENT

IMPROVING STOCHASTIC GRADIENT DESCENT IMPROVING STOCHASTIC GRADIENT DESCENT WITH FEEDBACK Jayanth Koushik & Hiroaki Hayashi Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA {jkoushik,hiroakih}@cs.cmu.edu

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

Bayesian Paragraph Vectors

Bayesian Paragraph Vectors Bayesian Paragraph Vectors Geng Ji 1, Robert Bamler 2, Erik B. Sudderth 1, and Stephan Mandt 2 1 Department of Computer Science, UC Irvine, {gji1, sudderth}@uci.edu 2 Disney Research, firstname.lastname@disneyresearch.com

More information

Gaussian Process Regression Models for Predicting Stock Trends

Gaussian Process Regression Models for Predicting Stock Trends Gaussian Process Regression Models or Predicting Stock Trends M. Todd Farrell Andrew Correa December 5, 7 Introduction Historical stock price data is a massive amount o time-series data with little-to-no

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Jointly Embedding Knowledge Graphs and Logical Rules

Jointly Embedding Knowledge Graphs and Logical Rules Jointly Embedding Knowledge Graphs and Logical Rules Shu Guo, Quan Wang, Lihong Wang, Bin Wang, Li Guo Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China University

More information

arxiv: v2 [cs.cl] 1 Jan 2019

arxiv: v2 [cs.cl] 1 Jan 2019 Variational Self-attention Model for Sentence Representation arxiv:1812.11559v2 [cs.cl] 1 Jan 2019 Qiang Zhang 1, Shangsong Liang 2, Emine Yilmaz 1 1 University College London, London, United Kingdom 2

More information

Deep Belief Networks are compact universal approximators

Deep Belief Networks are compact universal approximators 1 Deep Belief Networks are compact universal approximators Nicolas Le Roux 1, Yoshua Bengio 2 1 Microsoft Research Cambridge 2 University of Montreal Keywords: Deep Belief Networks, Universal Approximation

More information

Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop

Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop Project Tips Announcement: Hint for PSet1: Understand

More information

A Three-Way Model for Collective Learning on Multi-Relational Data

A Three-Way Model for Collective Learning on Multi-Relational Data A Three-Way Model for Collective Learning on Multi-Relational Data 28th International Conference on Machine Learning Maximilian Nickel 1 Volker Tresp 2 Hans-Peter Kriegel 1 1 Ludwig-Maximilians Universität,

More information

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

Learning Deep Architectures

Learning Deep Architectures Learning Deep Architectures Yoshua Bengio, U. Montreal Microsoft Cambridge, U.K. July 7th, 2009, Montreal Thanks to: Aaron Courville, Pascal Vincent, Dumitru Erhan, Olivier Delalleau, Olivier Breuleux,

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the

More information

Fast Learning with Noise in Deep Neural Nets

Fast Learning with Noise in Deep Neural Nets Fast Learning with Noise in Deep Neural Nets Zhiyun Lu U. of Southern California Los Angeles, CA 90089 zhiyunlu@usc.edu Zi Wang Massachusetts Institute of Technology Cambridge, MA 02139 ziwang.thu@gmail.com

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Deep Learning for NLP Part 2

Deep Learning for NLP Part 2 Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) 2 Part 1.3: The Basics Word Representations The

More information

Neighborhood Mixture Model for Knowledge Base Completion

Neighborhood Mixture Model for Knowledge Base Completion Neighborhood Mixture Model for Knowledge Base Completion Dat Quoc Nguyen 1, Kairit Sirts 1, Lizhen Qu 2 and Mark Johnson 1 1 Department of Computing, Macquarie University, Sydney, Australia dat.nguyen@students.mq.edu.au,

More information

The achievable limits of operational modal analysis. * Siu-Kui Au 1)

The achievable limits of operational modal analysis. * Siu-Kui Au 1) The achievable limits o operational modal analysis * Siu-Kui Au 1) 1) Center or Engineering Dynamics and Institute or Risk and Uncertainty, University o Liverpool, Liverpool L69 3GH, United Kingdom 1)

More information

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM c 1. (Natural Language Processing; NLP) (Deep Learning) RGB IBM 135 8511 5 6 52 yutat@jp.ibm.com a) b) 2. 1 0 2 1 Bag of words White House 2 [1] 2015 4 Copyright c by ORSJ. Unauthorized reproduction of

More information

Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations

Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations Fei Sun, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng CAS Key Lab of Network Data Science and Technology Institute

More information

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, et al. Google arxiv:1609.08144v2 Reviewed by : Bill

More information

Knowledge Extraction from DBNs for Images

Knowledge Extraction from DBNs for Images Knowledge Extraction from DBNs for Images Son N. Tran and Artur d Avila Garcez Department of Computer Science City University London Contents 1 Introduction 2 Knowledge Extraction from DBNs 3 Experimental

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6 Administration Registration Hw3 is out Due on Thursday 10/6 Questions Lecture Captioning (Extra-Credit) Look at Piazza for details Scribing lectures With pay; come talk to me/send email. 1 Projects Projects

More information

DISTRIBUTIONAL SEMANTICS

DISTRIBUTIONAL SEMANTICS COMP90042 LECTURE 4 DISTRIBUTIONAL SEMANTICS LEXICAL DATABASES - PROBLEMS Manually constructed Expensive Human annotation can be biased and noisy Language is dynamic New words: slangs, terminology, etc.

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee New Activation Function Rectified Linear Unit (ReLU) σ z a a = z Reason: 1. Fast to compute 2. Biological reason a = 0 [Xavier Glorot, AISTATS 11] [Andrew L.

More information

TUTORIAL PART 1 Unsupervised Learning

TUTORIAL PART 1 Unsupervised Learning TUTORIAL PART 1 Unsupervised Learning Marc'Aurelio Ranzato Department of Computer Science Univ. of Toronto ranzato@cs.toronto.edu Co-organizers: Honglak Lee, Yoshua Bengio, Geoff Hinton, Yann LeCun, Andrew

More information

word2vec Parameter Learning Explained

word2vec Parameter Learning Explained word2vec Parameter Learning Explained Xin Rong ronxin@umich.edu Abstract The word2vec model and application by Mikolov et al. have attracted a great amount of attention in recent two years. The vector

More information

Deep Learning & Neural Networks Lecture 4

Deep Learning & Neural Networks Lecture 4 Deep Learning & Neural Networks Lecture 4 Kevin Duh Graduate School of Information Science Nara Institute of Science and Technology Jan 23, 2014 2/20 3/20 Advanced Topics in Optimization Today we ll briefly

More information

text classification 3: neural networks

text classification 3: neural networks text classification 3: neural networks CS 585, Fall 2018 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs585/ Mohit Iyyer College of Information and Computer Sciences University

More information

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information Mathias Berglund, Tapani Raiko, and KyungHyun Cho Department of Information and Computer Science Aalto University

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Deep Learning For Mathematical Functions

Deep Learning For Mathematical Functions 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Logistic Regression. William Cohen

Logistic Regression. William Cohen Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting

More information

UNSUPERVISED LEARNING

UNSUPERVISED LEARNING UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training

More information

arxiv: v2 [cs.lg] 6 Jul 2017

arxiv: v2 [cs.lg] 6 Jul 2017 made of made of Analogical Inference for Multi-relational Embeddings Hanxiao Liu 1 Yuexin Wu 1 Yiming Yang 1 arxiv:1705.02426v2 [cs.lg] 6 Jul 2017 Abstract Large-scale multi-relational embedding refers

More information

Robust Residual Selection for Fault Detection

Robust Residual Selection for Fault Detection Robust Residual Selection or Fault Detection Hamed Khorasgani*, Daniel E Jung**, Gautam Biswas*, Erik Frisk**, and Mattias Krysander** Abstract A number o residual generation methods have been developed

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions - Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions Simon Luo The University of Sydney Data61, CSIRO simon.luo@data61.csiro.au Mahito Sugiyama National Institute of

More information

EXPLOITING LSTM STRUCTURE IN DEEP NEURAL NETWORKS FOR SPEECH RECOGNITION

EXPLOITING LSTM STRUCTURE IN DEEP NEURAL NETWORKS FOR SPEECH RECOGNITION EXPLOITING LSTM STRUCTURE IN DEEP NEURAL NETWORKS FOR SPEECH RECOGNITION Tianxing He Jasha Droppo Key Lab. o Shanghai Education Commission or Intelligent Interaction and Cognitive Engineering SpeechLab,

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35 Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How

More information

Maxout Networks. Hien Quoc Dang

Maxout Networks. Hien Quoc Dang Maxout Networks Hien Quoc Dang Outline Introduction Maxout Networks Description A Universal Approximator & Proof Experiments with Maxout Why does Maxout work? Conclusion 10/12/13 Hien Quoc Dang Machine

More information

A Novel Embedding Model for Knowledge Base Completion Based on Convolutional Neural Network

A Novel Embedding Model for Knowledge Base Completion Based on Convolutional Neural Network A Novel Embedding Model for Knowledge Base Completion Based on Convolutional Neural Network Dai Quoc Nguyen 1, Tu Dinh Nguyen 1, Dat Quoc Nguyen 2, Dinh Phung 1 1 Deakin University, Geelong, Australia,

More information

A Survey of Techniques for Sentiment Analysis in Movie Reviews and Deep Stochastic Recurrent Nets

A Survey of Techniques for Sentiment Analysis in Movie Reviews and Deep Stochastic Recurrent Nets A Survey of Techniques for Sentiment Analysis in Movie Reviews and Deep Stochastic Recurrent Nets Chase Lochmiller Department of Computer Science Stanford University cjloch11@stanford.edu Abstract In this

More information

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems Weinan E 1 and Bing Yu 2 arxiv:1710.00211v1 [cs.lg] 30 Sep 2017 1 The Beijing Institute of Big Data Research,

More information

Bayesian Networks (Part I)

Bayesian Networks (Part I) 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Bayesian Networks (Part I) Graphical Model Readings: Murphy 10 10.2.1 Bishop 8.1,

More information

Physics 2B Chapter 17 Notes - First Law of Thermo Spring 2018

Physics 2B Chapter 17 Notes - First Law of Thermo Spring 2018 Internal Energy o a Gas Work Done by a Gas Special Processes The First Law o Thermodynamics p Diagrams The First Law o Thermodynamics is all about the energy o a gas: how much energy does the gas possess,

More information

Fisher Consistency of Multicategory Support Vector Machines

Fisher Consistency of Multicategory Support Vector Machines Fisher Consistency o Multicategory Support Vector Machines Yueng Liu Department o Statistics and Operations Research Carolina Center or Genome Sciences University o North Carolina Chapel Hill, NC 7599-360

More information

Combining Two And Three-Way Embeddings Models for Link Prediction in Knowledge Bases

Combining Two And Three-Way Embeddings Models for Link Prediction in Knowledge Bases Journal of Artificial Intelligence Research 1 (2015) 1-15 Submitted 6/91; published 9/91 Combining Two And Three-Way Embeddings Models for Link Prediction in Knowledge Bases Alberto García-Durán alberto.garcia-duran@utc.fr

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) Human Brain Neurons Input-Output Transformation Input Spikes Output Spike Spike (= a brief pulse) (Excitatory Post-Synaptic Potential)

More information

arxiv: v3 [cs.cl] 30 Jan 2016

arxiv: v3 [cs.cl] 30 Jan 2016 word2vec Parameter Learning Explained Xin Rong ronxin@umich.edu arxiv:1411.2738v3 [cs.cl] 30 Jan 2016 Abstract The word2vec model and application by Mikolov et al. have attracted a great amount of attention

More information

CLASSIFICATION OF MULTIPLE ANNOTATOR DATA USING VARIATIONAL GAUSSIAN PROCESS INFERENCE

CLASSIFICATION OF MULTIPLE ANNOTATOR DATA USING VARIATIONAL GAUSSIAN PROCESS INFERENCE 26 24th European Signal Processing Conerence EUSIPCO CLASSIFICATION OF MULTIPLE ANNOTATOR DATA USING VARIATIONAL GAUSSIAN PROCESS INFERENCE Emre Besler, Pablo Ruiz 2, Raael Molina 2, Aggelos K. Katsaggelos

More information

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine Nitish Srivastava nitish@cs.toronto.edu Ruslan Salahutdinov rsalahu@cs.toronto.edu Geoffrey Hinton hinton@cs.toronto.edu

More information

arxiv: v1 [cs.lg] 3 Jan 2017

arxiv: v1 [cs.lg] 3 Jan 2017 Deep Convolutional Neural Networks for Pairwise Causality Karamjit Singh, Garima Gupta, Lovekesh Vig, Gautam Shroff, and Puneet Agarwal TCS Research, New-Delhi, India January 4, 2017 arxiv:1701.00597v1

More information

Knowledge Graph Embedding via Dynamic Mapping Matrix

Knowledge Graph Embedding via Dynamic Mapping Matrix Knowledge Graph Embedding via Dynamic Mapping Matrix Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu and Jun Zhao National Laboratory of Pattern Recognition (NLPR) Institute of Automation Chinese Academy of

More information

Lecture 15. Probabilistic Models on Graph

Lecture 15. Probabilistic Models on Graph Lecture 15. Probabilistic Models on Graph Prof. Alan Yuille Spring 2014 1 Introduction We discuss how to define probabilistic models that use richly structured probability distributions and describe how

More information

Greedy Layer-Wise Training of Deep Networks

Greedy Layer-Wise Training of Deep Networks Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny Story so far Deep neural nets are more expressive: Can learn

More information

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks. Nicholas Ruozzi University of Texas at Dallas Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016 GloVe on Spark Alex Adamson SUNet ID: aadamson June 6, 2016 Introduction Pennington et al. proposes a novel word representation algorithm called GloVe (Global Vectors for Word Representation) that synthesizes

More information

Deep Convolutional Neural Networks for Pairwise Causality

Deep Convolutional Neural Networks for Pairwise Causality Deep Convolutional Neural Networks for Pairwise Causality Karamjit Singh, Garima Gupta, Lovekesh Vig, Gautam Shroff, and Puneet Agarwal TCS Research, Delhi Tata Consultancy Services Ltd. {karamjit.singh,

More information

Continuous Space Language Model(NNLM) Liu Rong Intern students of CSLT

Continuous Space Language Model(NNLM) Liu Rong Intern students of CSLT Continuous Space Language Model(NNLM) Liu Rong Intern students of CSLT 2013-12-30 Outline N-gram Introduction data sparsity and smooth NNLM Introduction Multi NNLMs Toolkit Word2vec(Deep learing in NLP)

More information

An overview of word2vec

An overview of word2vec An overview of word2vec Benjamin Wilson Berlin ML Meetup, July 8 2014 Benjamin Wilson word2vec Berlin ML Meetup 1 / 25 Outline 1 Introduction 2 Background & Significance 3 Architecture 4 CBOW word representations

More information

CS 361 Meeting 28 11/14/18

CS 361 Meeting 28 11/14/18 CS 361 Meeting 28 11/14/18 Announcements 1. Homework 9 due Friday Computation Histories 1. Some very interesting proos o undecidability rely on the technique o constructing a language that describes the

More information

Analogical Inference for Multi-Relational Embeddings

Analogical Inference for Multi-Relational Embeddings Analogical Inference for Multi-Relational Embeddings Hanxiao Liu, Yuexin Wu, Yiming Yang Carnegie Mellon University August 8, 2017 nalogical Inference for Multi-Relational Embeddings 1 / 19 Task Description

More information

arxiv: v1 [cs.lg] 2 Nov 2017

arxiv: v1 [cs.lg] 2 Nov 2017 A Universal Marginalizer for Amortized Inference in Generative Models Laura Douglas, Iliyan Zarov, Konstantinos Gourgoulias, Chris Lucas, Chris Hart, Adam Baker, Maneesh Sahani, Yura Perov, Saurabh Johri

More information

arxiv: v2 [cs.sd] 7 Feb 2018

arxiv: v2 [cs.sd] 7 Feb 2018 AUDIO SET CLASSIFICATION WITH ATTENTION MODEL: A PROBABILISTIC PERSPECTIVE Qiuqiang ong*, Yong Xu*, Wenwu Wang, Mark D. Plumbley Center for Vision, Speech and Signal Processing, University of Surrey, U

More information

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture Neural Networks for Machine Learning Lecture 2a An overview of the main types of neural network architecture Geoffrey Hinton with Nitish Srivastava Kevin Swersky Feed-forward neural networks These are

More information

RaRE: Social Rank Regulated Large-scale Network Embedding

RaRE: Social Rank Regulated Large-scale Network Embedding RaRE: Social Rank Regulated Large-scale Network Embedding Authors: Yupeng Gu 1, Yizhou Sun 1, Yanen Li 2, Yang Yang 3 04/26/2018 The Web Conference, 2018 1 University of California, Los Angeles 2 Snapchat

More information