XLDB2018 KGBuilder: A System for Large-Scale Scientific Domain Knowledge Graph Building Yi Zhang, Xiaofeng Meng WAMDM@RUC 5/3/2018
Knowledge Graph 什么是知识图谱 (Knowledge Graph)? Knowledge Graph Language Open Domain UMLS Non-Chinese Chinese Microbiology TCM KG Business KG Ethic Chinese KG Specific Domain Fact
Microbiology Knowledge Graph More symbol words & id-like entities Entities with long name Same head-relation pair links various tails "ADH" "alanopine dehydrogenase" " created 1983, modified 1986" othername othername name "ALPDH" 1.5.1.17 substrate "NAD+" "10225379" "14053246" "10225380" 1.3.99.8 "bpy:bphyt_2183" "Oxidoreductases" "Acting on the CH- NH group of donors" EnzymeNode "amim:mim_c3 4440" name 1.1.1.139 "Deleted entry" "With NAD+ or NADP+ as acceptor" "created 1972, deleted 1978"
Knowledge Graph Building Specific Domain Auto-DB to Knowledge Auto-Text to Knowledge Open Domain
Overview of KGBuilder
Key Technologies of KGBuilder Named Entity Recognition Naïve KBE Pro-based KBE Loc-based KBE Distant Supervising Intra-Sentence Cross-Sentence Relation Extraction Knowledge Graph Completion TransMT TransMT v TransMT s
Named Entity Recognition Making full use of domain knowledge. F1 Score(%) for NER Baseline 42.00 45.08 46.78 49.28 52.24 61.07 60.81 65.56 35.56 33.51 35.61 35.81 Naïve KBE Pro-based KBE Loc-based KBE Overall Bacteria Habitat
Relation Extraction More tagged data & making full use of domain knowledge softmax layer attention α 2 "10225379" "10225380" "amim:mim_c3 4440" Experimental Results(%) 1.3.99.8 "bpy:bphyt_2183"... entity "ADH" "alanopine dehydrogenase" " created 1983, modified 1986" othername...... name w1 w2 wn w1 w2 wn entity order "ALPDH" othername attention α 1 1.5.1.17 hidden state substrate word embedding "NAD+" loc embedding "14053246" "Oxidoreductases" "Acting on the CH- NH group of donors" Methods Precision Recall F1 VERSE EnzymeNode 51.0 61.5 55.8 Ours 48.3 name 60.5"created 1972, 53.8 1.1.1.139 "Deleted entry" deleted 1978" TurkuNLP 62.3 44.8 52.1 LIMSI 38.8 64.6 48.5 HK 59.9 "With NAD+ 39.2 or 47.4 NADP+ as acceptor" WhuNlpRE 55.9 40.7 47.1 DUTIR 56.6 38.2 45.6 Manual Feature Engineering
Knowledge Graph Completion Overcoming the unbalance between heads and tails "ADH" "alanopine dehydrogenase" hh aa = MM tt hh rr aa = MM tt rr substrate 2 ff " tt h, created rr = 1983, hh aa + rr aa tt LLL/LLL modified 1986" name "ALPDH" othername othername 1.5.1.17 "NAD+" "10225379" "14053246" "10225380" 1 0.4 0.2 "Oxidoreductases" 0 EnzymeNode 2500 2000 1500 1000 "Acting on the CH- 500 NH group of donors" 0 Hit of Prediction 0.8 0.6 1.3.99.8 "bpy:bphyt_2183" raw filt raw filt raw filt Heads Prediction Tails Prediction Relations Prediction TransE TransH TransR TransD TransSparse TransMT "amim:mim_c3 4440" name MeanRank of Prediction 1.1.1.139 "Deleted entry" "With NAD+ or NADP+ as acceptor" "created 1972, deleted 1978" raw filt raw filt raw filt Heads Prediction Tails Prediction Relations Prediction TransE TransH TransR TransD TransSparse TransMT
Conclusion & Discussion "ADH" "alanopine dehydrogenase" " created 1983, modified 1986" name "ALPDH" othername othername 1.5.1.17 substrate "NAD+" "10225379" "14053246" "10225380" 1.3.99.8 "bpy:bphyt_2183" "Oxidoreductases" "Acting on the CH- NH group of donors" EnzymeNode name 1.1.1.139 "amim:mim_c3 4440" "Deleted entry" "With NAD+ or NADP+ as acceptor" "created 1972, deleted 1978" Future Work More modalities Knowledge Graph Completion Relation Extraction Named Entity Recognition More triplets Relations Entities More domains Text
Knowledge Graph & Scientific Discoveries Multi-Source Heterogeneous Microbiology Data Enzyme Protein Gene KGBuilder Living environment Data Lesion Inducements Structure & Function Data Lesion Causes Bio/Chem Characteristics Lesion Trends Pharmacology Characteristics Medicine Discovery Applications Interaction Query Literature Analysis Path Discovery
Supported by the Project on Scientific Big Data System Background The Scientific Big Data System is funded by the 'National Key R&D Plan: Cloud Computing and Big Data'. Led by Chinese Academy of Sciences and joint 16 universities and institutions. Goals Astronomy: efficiency storage&analysis of 100billion lines astronomical catalogs High-energy physics: high-efficiency storage and retrieval of trillion-event data Bioscience: retrieval of multi-level correlation of 10-billion edge RDF knowledge graphs --Accelerating scientific discovery
yizhang1208@ruc.edu.cn