Special Topics in Computer Science

Similar documents
ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database

CLRG Biocreative V

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Text mining and natural language analysis. Jefrey Lijffijt

Bio-Medical Text Mining with Machine Learning

Maschinelle Sprachverarbeitung

Formal Ontology Construction from Texts

TALP at GeoQuery 2007: Linguistic and Geographical Analysis for Query Parsing

Information Extraction from Biomedical Text

Bringing machine learning & compositional semantics together: central concepts

Information Extraction from Biomedical Text. BMI/CS 776 Mark Craven

Maschinelle Sprachverarbeitung

Literature-Based Discovery: Critical Analysis and Future Directions

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

A Study of Biomedical Concept Identification: MetaMap vs. People

STRING: Protein association networks. Lars Juhl Jensen

Mining and Modelling Interaction Networks for Systems Biology. Supervisors: Prof. Dr. Véronique Hoste Dr. Chris Cornelis Prof. Dr.

Gene mention normalization in full texts using GNAT and LINNAEUS

ChemDataExtractor: A toolkit for automated extraction of chemical information from the scientific literature

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Driving Semantic Parsing from the World s Response

Citation for published version (APA): Andogah, G. (2010). Geographically constrained information retrieval Groningen: s.n.

Title. Author(s)Moustafa Dieb, Thaer. Issue Date DOI. Doc URL. Type. File Information. Development /doctoral.

Using Web Technologies for Integrative Drug Discovery

Information Extraction from Text

The Quadratic Entropy Approach to Implement the Id3 Decision Tree Algorithm

Modeling Biological Processes for Reading Comprehension

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

Spatial Role Labeling CS365 Course Project

Chemists are from Mars, Biologists from Venus. Originally published 7th November 2006

Extraction of Opposite Sentiments in Classified Free Format Text Reviews

Uncovering protein interaction in abstracts and text using a novel linear model and word proximity networks

A Comparative Study of Current Clinical NLP Systems on Handling Abbreviations

Section Classification in Clinical Notes using Supervised Hidden Markov Model

A DOP Model for LFG. Rens Bod and Ronald Kaplan. Kathrin Spreyer Data-Oriented Parsing, 14 June 2005

Advanced Natural Language Processing Syntactic Parsing

Parsing with CFGs L445 / L545 / B659. Dept. of Linguistics, Indiana University Spring Parsing with CFGs. Direction of processing

Parsing with CFGs. Direction of processing. Top-down. Bottom-up. Left-corner parsing. Chart parsing CYK. Earley 1 / 46.

PIOTR GOLKIEWICZ LIFE SCIENCES SOLUTIONS CONSULTANT CENTRAL-EASTERN EUROPE

Supervisor: Prof. Stefano Spaccapietra Dr. Fabio Porto Student: Yuanjian Wang Zufferey. EPFL - Computer Science - LBD 1

Term Generalization and Synonym Resolution for Biological Abstracts: Using the Gene Ontology for Subcellular Localization Prediction

Computational Biology, University of Maryland, College Park, MD, USA

Global Machine Learning for Spatial Ontology Population

Hierachical Name Entity Recognition

odeling atient ortality from linical ote

Syntactic Patterns of Spatial Relations in Text

Prof. Dr. Ralf Möller Dr. Özgür L. Özçep Universität zu Lübeck Institut für Informationssysteme. Tanya Braun (Exercises)

Automated Geoparsing of Paris Street Names in 19th Century Novels

LECTURER: BURCU CAN Spring

Cross-Lingual Language Modeling for Automatic Speech Recogntion

10/17/04. Today s Main Points

The Role of Network Science in Biology and Medicine. Tiffany J. Callahan Computational Bioscience Program Hunter/Kahn Labs

Question Answering on Statistical Linked Data

Two-Sample Inferential Statistics

CORE: Context-Aware Open Relation Extraction with Factorization Machines. Fabio Petroni

The Potential Use of SUISEKI as a Protein Interaction

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Evolution of a Foundational Model of Physiology: Symbolic Representation for Functional Bioinformatics

Gene Ontology and overrepresentation analysis

Data Warehousing & Data Mining

Introduction to the CoNLL-2004 Shared Task: Semantic Role Labeling

Parsing with Context-Free Grammars

Entropy. Leonoor van der Beek, Department of Alfa-informatica Rijksuniversiteit Groningen. May 2005

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Computational Biology Course Descriptions 12-14

Text Analytics (Text Mining)

Harvard CS 121 and CSCI E-207 Lecture 9: Regular Languages Wrap-Up, Context-Free Grammars

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

MTopGO: a tool for module identification in PPI Networks

Chemistry-specific Features and Heuristics for Developing a CRF-based Chemical Named Entity Recogniser

Attention Based Joint Model with Negative Sampling for New Slot Values Recognition. By: Mulan Hou

Exploiting Tree Kernels for High Performance Chemical Induced Disease. relation extraction.

Annotation tasks and solutions in CLARIN-PL

Designing and Evaluating Generic Ontologies

Catching the Drift Indexing Implicit Knowledge in Chemical Digital Libraries

arxiv: v2 [cs.cl] 20 Aug 2016

Classification of Study Region in Environmental Science Abstracts

Automated Summarisation for Evidence Based Medicine

Large Scale Evaluation of Chemical Structure Recognition 4 th Text Mining Symposium in Life Sciences October 10, Dr.

Science Course Descriptions

A Support Vector Method for Multivariate Performance Measures

STRUCTURAL BIOINFORMATICS I. Fall 2015

Francisco M. Couto Mário J. Silva Pedro Coutinho

Project Halo: Towards a Knowledgeable Biology Textbook. Peter Clark Vulcan Inc.

LEARNING COMPOSITIONALITY

National Centre for Language Technology School of Computing Dublin City University

Probabilistic Context-free Grammars

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr

Creating a Gold Standard Corpus for the Extraction of Chemistry-Disease Relations from Patent Texts

Overview Multiple Sequence Alignment

Knowledge Discovery in Climate Science using Jess rule Engine

SABIO-RK Integration and Curation of Reaction Kinetics Data Ulrike Wittig

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

Automated annotation of chemical names in the literature with tunable accuracy

Proposition Knowledge Graphs. Gabriel Stanovsky Omer Levy Ido Dagan Bar-Ilan University Israel

Mining coreference relations between formulas and text using Wikipedia

Chunking with Support Vector Machines

Spatial Role Labeling: Towards Extraction of Spatial Relations from Natural Language

Bioinformatics Chapter 1. Introduction

Schema Free Querying of Semantic Data. by Lushan Han

Transcription:

Special Topics in Computer Science NLP in a Nutshell CS492B Spring Semester 2009 Speaker : Hee Jin Lee p Professor : Jong C. Park Computer Science Department Korea Advanced Institute of Science and Technology

TEXT MINING APPLICATIONS: INFORMATION EXTRACTION

Contents Information Extraction: What? and Why? Approaches to Information Extraction Information Extraction Challenges Application: Literature based Discovery Conclusion NLP in a Nutshell 3

Information Extraction (IE) What is done by IE? Take a natural language text from a document source, and extract essential facts about one or more predefined fact types Represent each fact with iha template whose slots are filled on the basis of what is found from the text We have previously shown that ETS1 can activate GM CSF in Jurkat T cells. Activate(ETS1, GM CSF) NLP in a Nutshell 4

Information Extraction (IE) IE vs. IR Information Retrieval (IR) Returns documents. Is a classification task (each document is relevant/not relevant to a query). Information extraction (IE) Returns facts. Is an application of natural language processing, involving the analysis of text and synthesis of a structured representation. Can be done without ih reference to Is based on syntactic analysisand syntax (treating query and indeed semantic analysis the documents as merely a bag of words ). NLP in a Nutshell 5

IE in Biology and Biomedicine A large amount published paper in the domain of biology and biomedicine 18,000,000 16,000,000 14,000,000 12,000,000 10,000,000 8,000,000 6,000,000 4,000,000 2,000,000 0 Total citations in MEDLINE Total citations Experts cannot check all the relevant papers. We can help them with automated tools. NLP in a Nutshell 6

Approaches to IE Pattern matching approaches Basic context free grammar approaches Full parsing approaches Probability based parsing Mixed syntax semantics approaches Sublanguage driven information extraction Ontology driven information extraction IE methods have evolved from simpler methods like pattern matching, to higher level NLP techniques such as full parsing. NLP in a Nutshell 7

Pattern Matching Approaches Martin et al. (2004) Extract protein protein interaction Use a number of dictionariesi i Protein names and their synonyms Protein interaction verbs and their synonyms Common strings to identify unknown proteins (e.g., protein, kinase) Sample pattern ($VarGene $Verb (the)? $VarGene) NLP in a Nutshell 8

Full Parsing Approaches: BioIE Kim and Park (2004) Extract general biological interactions Start with ihidentifying if i keyword verbs b and their arguments using pattern matching Full parsing is used to validate the pattern matching result Performance on corpora of 1,505 abstracts NLP in a Nutshell 9

Full Parsing Approaches: BioIE System flow NP matching is done in a bidirectional way using heuristic rules. NLP in a Nutshell 10

Full Parsing Approaches: BioIE Example NLP in a Nutshell 11

Full Parsing Approaches: RelEx Fundel et al., (2007) Extract gene/protein interactions Start with identifying gene/protein names Does not identify the kind of interaction Relation extraction rather than information extraction Performance (Recall/Precision/F measure) re) 85/79/82 on the LLL challenge data set 78/79/78 on a 50 abstract subset of the Human Protein Reference Database NLP in a Nutshell 12

Full Parsing Approaches: RelEx System overview Stanford Lexicalized Parser ProMiner NER system fntbl NP chunker Extract paths connecting pairs of proteins from dependency parse trees NLP in a Nutshell 13

Full Parsing Approaches: RelEx Example Interacting protein pairs (sigmab, yvyd) (Sigma H, yvyd) NLP in a Nutshell 14

IE Challenges To compare the performance of different approaches, common standards or shared evaluation criteria are needed IE challenges Propose tasks Develop and distribute large enough training and test datasets NLP in a Nutshell 15

BioCreAtIvE Challenge Critical Assessment of Information Extraction systems in Biology http://biocreative.sourceforge.net i IE task in BioCreative 2 (2006) Task Description Highest F score Protein interaction article Detection of protein interaction relevant 0.78 sub task(ias) articles (P:0.70, R:0.88) Protein interaction pairs subtask(ips) Extraction and normalization of protein interaction pairs 0.30 (P:0.37, R:0.33) Protein interaction ti sentence Retrieval of actual text t passage that t P:0.19 sub task (ISS) provide evidence for protein interactions Protein interaction method sub task (IMS) Retrieval of the interaction detection method 0.65 (P:0.59, R:0.85) NLP in a Nutshell 16

Literature based Discovery (LBD) Literature based discovery A method for automatically generating hypotheses for scientific research by finding overlooked implicit connections in the research literature NLP in a Nutshell 17

LBD: a Simple Scenario Primary concepts Diseases Drugs Symptoms Relations Cause(Disease, symptom) Decrease(Drug, symptom) Discoveries Treat(Drug, Disease) NLP in a Nutshell 18

LBD: a Simple Scenario Use an IE system to extract relations from the literature Cause(Rynaud s s disease, blood viscosity reduction) Cause (Rynaud s disease, platelet aggregation reduction) Increase(Fish oil, blood viscosity) Increase(Fish oil, plate aggregation) Hypothesize a new relation a discovery! Treat(Fish oil, Rynaud s disease) Confirm with laboratory methods NLP in a Nutshell 19

LBD: a Real Example Hristovski et al. (2006) Their discovery pattern NLP in a Nutshell 20

Their method LBD: a Real Example Start with a disease X in mind Find physiological i l concepts Y s that frequently co occur with the disease X Extract relations between X and Y s Find concepts Z s co occur with Y s Extract relations between Z s and Y s Make hypotheses using discovery pattern BITOLA, BioMedLee, SemRep are used. NLP in a Nutshell 21

LBD: a Real Example What they found Treat(eicosanpentaenoic acid, Rynaud s) Treat(Treatment for diabetes, Rynaud s) NLP in a Nutshell 22

Conclusion Information Extraction is to extract structured information from unstructured text. IE methods have evolved from simpler methods to higher level NLP techniques. Challenges provide gold standard datasets for evaluation. IE systems can be used for literature based discovery. NLP in a Nutshell 23

References John McNaught, William J Black, Information Extraction, Text Mining for Biology and Biomedicine, 2006. Martin, E. P., et al., Analysis of Protein/Protein Interactions Through Biomedical Literature: Text Mining of Abstracts vs. Text Mining of Full Articles, Knowledge Exploration in Life Science Informatics, 2004. Kim, J., J. Park. BioIE: Retargetable information extraction and ontological annotation of biological interactions from the literature. Journal of Bioinformatics and Computational Biology 2, no. 3,551 568, 568, 2004. Katrin Fundel, Robert Kuffner, Ralf Zimmer, RelEx Relation extraction using dependency yp parse tree, Bioinformatics,, vol. 23, no. 3, 2007. Pierre Zweigenbaum, Dina Demner Fushman, Hong Yu, Kevin B. Cohen, Frontiers of biomedical text mining: current progress, Briefings in bioinformatics, vol. 8, no. 5, 358 375, 2007. Dimitar Hristovski, Carol Friedman, Thomas C Rindflesch, Borut Peterlin, Exploiting Semantic Relations for Literature Based Discovery, AMIA, 2006. NLP in a Nutshell 24

Thank you NLP in a Nutshell 25

Raynaud s Disease Raynaud's disease (RAY noz) is a vascular disorder [1] that affects blood flow to the extremities (the fingers, toes, nose and ears) when exposed to cold temperatures or in response to psychological stress. It is named for Maurice Raynaud (1834 1881), [2] a French physician who first described it in 1862. [3] NLP in a Nutshell 26

Huntington Disease An autosomal dominant inherited neurodegenerative disorder that is characterized by the insidious progressive development of mood disturbances, behavioral changes, involuntary choreiform movements and cognitive impairments. Onset is most commonly in adulthood, with a typical duration of 15 20 years before premature death. NLP in a Nutshell 27