KGBuilder: A System for Large-Scale Scientific Domain Knowledge Graph Building

Similar documents
ParaGraphE: A Library for Parallel Knowledge Graph Embedding

Embedding-Based Techniques MATRICES, TENSORS, AND NEURAL NETWORKS

Knowledge Graph Completion with Adaptive Sparse Transfer Matrix

Analogical Inference for Multi-Relational Embeddings

Hidden Markov Models Hamid R. Rabiee

Supplementary Material: Towards Understanding the Geometry of Knowledge Graph Embeddings

Correlation Autoencoder Hashing for Supervised Cross-Modal Search

Learning Entity and Relation Embeddings for Knowledge Graph Completion

An Introduction to Bioinformatics Algorithms Hidden Markov Models

ChemDataExtractor: A toolkit for automated extraction of chemical information from the scientific literature

Open PHACTS Explorer: Compound by Name

Hidden Markov Models

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

EECS730: Introduction to Bioinformatics

Information Extraction from Text

CSCE 561 Information Retrieval System Models

Modeling Topics and Knowledge Bases with Embeddings

Attention Based Joint Model with Negative Sampling for New Slot Values Recognition. By: Mulan Hou

TechKG: A Large-Scale Chinese Technology-Oriented Knowledge Graph

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Tracking the World State with Recurrent Entity Networks

arxiv: v2 [cs.cl] 28 Sep 2015

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

A Translation-Based Knowledge Graph Embedding Preserving Logical Property of Relations

Week 10: Homology Modelling (II) - HHpred

A teacher demonstrates the production of circular waves in a ripple tank. Diagram 1

CLRG Biocreative V

METABOLISM CHAPTER 04 BIO 211: ANATOMY & PHYSIOLOGY I. Dr. Lawrence G. Altman Some illustrations are courtesy of McGraw-Hill.

CELL METABOLISM OVERVIEW Keep the big picture in mind as we discuss the particulars!

Outline. Terminologies and Ontologies. Communication and Computation. Communication. Outline. Terminologies and Vocabularies.

Knowledge Graph Embedding with Diversity of Structures

CMPS 561 Boolean Retrieval. Ryan Benton Sept. 7, 2011

The BRENDA Enzyme Information System. Computer-based access. Module B5

Presented By: Omer Shmueli and Sivan Niv

Lecture 13: Structured Prediction

STRING: Protein association networks. Lars Juhl Jensen

Caspase-1 Specific Light-up Probe with Aggregation-Induced Emission. Characteristics for Inhibitor Screening of Coumarin-Originated Natural.

Knowledge Graph Embedding via Dynamic Mapping Matrix

Representing metabolic networks

CSC321 Lecture 7 Neural language models

The Relevance of Spatial Relation Terms and Geographical Feature Types

The light reactions convert solar energy to the chemical energy of ATP and NADPH

GraspIT Questions AQA GCSE Physics Space physics

The Role of Network Science in Biology and Medicine. Tiffany J. Callahan Computational Bioscience Program Hunter/Kahn Labs

CSCE555 Bioinformatics. Protein Function Annotation

Welcome to AP Biology!

Web-Mining Agents. Multi-Relational Latent Semantic Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Neural Architectures for Image, Language, and Speech Processing

Information Retrieval and Organisation

Single alignment: Substitution Matrix. 16 march 2017

Applied Natural Language Processing

Gene mention normalization in full texts using GNAT and LINNAEUS

Computational Genomics and Molecular Biology, Fall

Class Notes. Topic. Questions, Subtitles, Headings, Etc. 3 to 4 sentence summary across the bottom of the last page of the day s notes 8/21/ /2

CIKM 18, October 22-26, 2018, Torino, Italy

Large Scale Evaluation of Chemical Structure Recognition 4 th Text Mining Symposium in Life Sciences October 10, Dr.

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

CSE 150. Assignment 6 Summer Maximum likelihood estimation. Out: Thu Jul 14 Due: Tue Jul 19

Locally Adaptive Translation for Knowledge Graph Embedding

Section 1 The Light Reactions. Section 2 The Calvin Cycle. Resources

Photosynthesis. Chapter 8

RayBio Glucose Dehydrogenase Activity Assay Kit

Maxent Models and Discriminative Estimation

DRUG DISCOVERY TODAY ELN ELN. Chemistry. Biology. Known ligands. DBs. Generate chemistry ideas. Check chemical feasibility In-house.

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

SABIO-RK Integration and Curation of Reaction Kinetics Data Ulrike Wittig

BBS2710 Microbial Physiology. Module 5 - Energy and Metabolism

Sequence Analysis and Databases 2: Sequences and Multiple Alignments

with Local Dependencies

Photosynthesis Overview. Photosynthesis Overview. Photosynthesis Overview. Photosynthesis

Information Retrieval

Intelligent GIS: Automatic generation of qualitative spatial information

Task-Oriented Dialogue System (Young, 2000)

Hidden Markov Models

ENZYMES. by: Dr. Hadi Mozafari

Astroinformatics: massive data research in Astronomy Kirk Borne Dept of Computational & Data Sciences George Mason University

Unit 1: Sequence Models

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Statistical Methods for NLP

Today. Next lecture. (Ch 14) Markov chains and hidden Markov models

Chapter 3. Chemistry of Life

Statistical Methods for NLP

Overview Students read about the structure of the universe and then compare the sizes of different objects in the universe.

Large-Scale Genomic Surveys

A Convolutional Neural Network-based

Translation Part 2 of Protein Synthesis

Multi-Task Structured Prediction for Entity Analysis: Search Based Learning Algorithms

Unsupervised Rank Aggregation with Distance-Based Models

Analysis of 2x2 Cross-Over Designs using T-Tests

ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database

IMMERSIVE GRAPH-BASED VISUALIZATION AND EXPLORATION OF BIOLOGICAL DATA RELATIONSHIPS

Bruce Hendrickson Discrete Algorithms & Math Dept. Sandia National Labs Albuquerque, New Mexico Also, CS Department, UNM

Named Entity Recognition using Maximum Entropy Model SEEM5680

Syntactic Patterns of Spatial Relations in Text

Chemistry 1506: Allied Health Chemistry 2. Section 10: Enzymes. Biochemical Catalysts. Outline

Enzymes and kinetics. Eva Samcová and Petr Tůma

The products have more enthalpy and are more ordered than the reactants.

NAD/NADH Microplate Assay Kit User Manual

Hidden Markov Models

Test Name: 09.LCW.0352.SCIENCE.GR Q1.S.THEUNIVERSE-SOLARSYSTEMHONORS Test ID: Date: 09/21/2017

Transcription:

XLDB2018 KGBuilder: A System for Large-Scale Scientific Domain Knowledge Graph Building Yi Zhang, Xiaofeng Meng WAMDM@RUC 5/3/2018

Knowledge Graph 什么是知识图谱 (Knowledge Graph)? Knowledge Graph Language Open Domain UMLS Non-Chinese Chinese Microbiology TCM KG Business KG Ethic Chinese KG Specific Domain Fact

Microbiology Knowledge Graph More symbol words & id-like entities Entities with long name Same head-relation pair links various tails "ADH" "alanopine dehydrogenase" " created 1983, modified 1986" othername othername name "ALPDH" 1.5.1.17 substrate "NAD+" "10225379" "14053246" "10225380" 1.3.99.8 "bpy:bphyt_2183" "Oxidoreductases" "Acting on the CH- NH group of donors" EnzymeNode "amim:mim_c3 4440" name 1.1.1.139 "Deleted entry" "With NAD+ or NADP+ as acceptor" "created 1972, deleted 1978"

Knowledge Graph Building Specific Domain Auto-DB to Knowledge Auto-Text to Knowledge Open Domain

Overview of KGBuilder

Key Technologies of KGBuilder Named Entity Recognition Naïve KBE Pro-based KBE Loc-based KBE Distant Supervising Intra-Sentence Cross-Sentence Relation Extraction Knowledge Graph Completion TransMT TransMT v TransMT s

Named Entity Recognition Making full use of domain knowledge. F1 Score(%) for NER Baseline 42.00 45.08 46.78 49.28 52.24 61.07 60.81 65.56 35.56 33.51 35.61 35.81 Naïve KBE Pro-based KBE Loc-based KBE Overall Bacteria Habitat

Relation Extraction More tagged data & making full use of domain knowledge softmax layer attention α 2 "10225379" "10225380" "amim:mim_c3 4440" Experimental Results(%) 1.3.99.8 "bpy:bphyt_2183"... entity "ADH" "alanopine dehydrogenase" " created 1983, modified 1986" othername...... name w1 w2 wn w1 w2 wn entity order "ALPDH" othername attention α 1 1.5.1.17 hidden state substrate word embedding "NAD+" loc embedding "14053246" "Oxidoreductases" "Acting on the CH- NH group of donors" Methods Precision Recall F1 VERSE EnzymeNode 51.0 61.5 55.8 Ours 48.3 name 60.5"created 1972, 53.8 1.1.1.139 "Deleted entry" deleted 1978" TurkuNLP 62.3 44.8 52.1 LIMSI 38.8 64.6 48.5 HK 59.9 "With NAD+ 39.2 or 47.4 NADP+ as acceptor" WhuNlpRE 55.9 40.7 47.1 DUTIR 56.6 38.2 45.6 Manual Feature Engineering

Knowledge Graph Completion Overcoming the unbalance between heads and tails "ADH" "alanopine dehydrogenase" hh aa = MM tt hh rr aa = MM tt rr substrate 2 ff " tt h, created rr = 1983, hh aa + rr aa tt LLL/LLL modified 1986" name "ALPDH" othername othername 1.5.1.17 "NAD+" "10225379" "14053246" "10225380" 1 0.4 0.2 "Oxidoreductases" 0 EnzymeNode 2500 2000 1500 1000 "Acting on the CH- 500 NH group of donors" 0 Hit of Prediction 0.8 0.6 1.3.99.8 "bpy:bphyt_2183" raw filt raw filt raw filt Heads Prediction Tails Prediction Relations Prediction TransE TransH TransR TransD TransSparse TransMT "amim:mim_c3 4440" name MeanRank of Prediction 1.1.1.139 "Deleted entry" "With NAD+ or NADP+ as acceptor" "created 1972, deleted 1978" raw filt raw filt raw filt Heads Prediction Tails Prediction Relations Prediction TransE TransH TransR TransD TransSparse TransMT

Conclusion & Discussion "ADH" "alanopine dehydrogenase" " created 1983, modified 1986" name "ALPDH" othername othername 1.5.1.17 substrate "NAD+" "10225379" "14053246" "10225380" 1.3.99.8 "bpy:bphyt_2183" "Oxidoreductases" "Acting on the CH- NH group of donors" EnzymeNode name 1.1.1.139 "amim:mim_c3 4440" "Deleted entry" "With NAD+ or NADP+ as acceptor" "created 1972, deleted 1978" Future Work More modalities Knowledge Graph Completion Relation Extraction Named Entity Recognition More triplets Relations Entities More domains Text

Knowledge Graph & Scientific Discoveries Multi-Source Heterogeneous Microbiology Data Enzyme Protein Gene KGBuilder Living environment Data Lesion Inducements Structure & Function Data Lesion Causes Bio/Chem Characteristics Lesion Trends Pharmacology Characteristics Medicine Discovery Applications Interaction Query Literature Analysis Path Discovery

Supported by the Project on Scientific Big Data System Background The Scientific Big Data System is funded by the 'National Key R&D Plan: Cloud Computing and Big Data'. Led by Chinese Academy of Sciences and joint 16 universities and institutions. Goals Astronomy: efficiency storage&analysis of 100billion lines astronomical catalogs High-energy physics: high-efficiency storage and retrieval of trillion-event data Bioscience: retrieval of multi-level correlation of 10-billion edge RDF knowledge graphs --Accelerating scientific discovery

yizhang1208@ruc.edu.cn