Special Topics in Computer Science NLP in a Nutshell CS492B Spring Semester 2009 Speaker : Hee Jin Lee p Professor : Jong C. Park Computer Science Department Korea Advanced Institute of Science and Technology
TEXT MINING APPLICATIONS: INFORMATION EXTRACTION
Contents Information Extraction: What? and Why? Approaches to Information Extraction Information Extraction Challenges Application: Literature based Discovery Conclusion NLP in a Nutshell 3
Information Extraction (IE) What is done by IE? Take a natural language text from a document source, and extract essential facts about one or more predefined fact types Represent each fact with iha template whose slots are filled on the basis of what is found from the text We have previously shown that ETS1 can activate GM CSF in Jurkat T cells. Activate(ETS1, GM CSF) NLP in a Nutshell 4
Information Extraction (IE) IE vs. IR Information Retrieval (IR) Returns documents. Is a classification task (each document is relevant/not relevant to a query). Information extraction (IE) Returns facts. Is an application of natural language processing, involving the analysis of text and synthesis of a structured representation. Can be done without ih reference to Is based on syntactic analysisand syntax (treating query and indeed semantic analysis the documents as merely a bag of words ). NLP in a Nutshell 5
IE in Biology and Biomedicine A large amount published paper in the domain of biology and biomedicine 18,000,000 16,000,000 14,000,000 12,000,000 10,000,000 8,000,000 6,000,000 4,000,000 2,000,000 0 Total citations in MEDLINE Total citations Experts cannot check all the relevant papers. We can help them with automated tools. NLP in a Nutshell 6
Approaches to IE Pattern matching approaches Basic context free grammar approaches Full parsing approaches Probability based parsing Mixed syntax semantics approaches Sublanguage driven information extraction Ontology driven information extraction IE methods have evolved from simpler methods like pattern matching, to higher level NLP techniques such as full parsing. NLP in a Nutshell 7
Pattern Matching Approaches Martin et al. (2004) Extract protein protein interaction Use a number of dictionariesi i Protein names and their synonyms Protein interaction verbs and their synonyms Common strings to identify unknown proteins (e.g., protein, kinase) Sample pattern ($VarGene $Verb (the)? $VarGene) NLP in a Nutshell 8
Full Parsing Approaches: BioIE Kim and Park (2004) Extract general biological interactions Start with ihidentifying if i keyword verbs b and their arguments using pattern matching Full parsing is used to validate the pattern matching result Performance on corpora of 1,505 abstracts NLP in a Nutshell 9
Full Parsing Approaches: BioIE System flow NP matching is done in a bidirectional way using heuristic rules. NLP in a Nutshell 10
Full Parsing Approaches: BioIE Example NLP in a Nutshell 11
Full Parsing Approaches: RelEx Fundel et al., (2007) Extract gene/protein interactions Start with identifying gene/protein names Does not identify the kind of interaction Relation extraction rather than information extraction Performance (Recall/Precision/F measure) re) 85/79/82 on the LLL challenge data set 78/79/78 on a 50 abstract subset of the Human Protein Reference Database NLP in a Nutshell 12
Full Parsing Approaches: RelEx System overview Stanford Lexicalized Parser ProMiner NER system fntbl NP chunker Extract paths connecting pairs of proteins from dependency parse trees NLP in a Nutshell 13
Full Parsing Approaches: RelEx Example Interacting protein pairs (sigmab, yvyd) (Sigma H, yvyd) NLP in a Nutshell 14
IE Challenges To compare the performance of different approaches, common standards or shared evaluation criteria are needed IE challenges Propose tasks Develop and distribute large enough training and test datasets NLP in a Nutshell 15
BioCreAtIvE Challenge Critical Assessment of Information Extraction systems in Biology http://biocreative.sourceforge.net i IE task in BioCreative 2 (2006) Task Description Highest F score Protein interaction article Detection of protein interaction relevant 0.78 sub task(ias) articles (P:0.70, R:0.88) Protein interaction pairs subtask(ips) Extraction and normalization of protein interaction pairs 0.30 (P:0.37, R:0.33) Protein interaction ti sentence Retrieval of actual text t passage that t P:0.19 sub task (ISS) provide evidence for protein interactions Protein interaction method sub task (IMS) Retrieval of the interaction detection method 0.65 (P:0.59, R:0.85) NLP in a Nutshell 16
Literature based Discovery (LBD) Literature based discovery A method for automatically generating hypotheses for scientific research by finding overlooked implicit connections in the research literature NLP in a Nutshell 17
LBD: a Simple Scenario Primary concepts Diseases Drugs Symptoms Relations Cause(Disease, symptom) Decrease(Drug, symptom) Discoveries Treat(Drug, Disease) NLP in a Nutshell 18
LBD: a Simple Scenario Use an IE system to extract relations from the literature Cause(Rynaud s s disease, blood viscosity reduction) Cause (Rynaud s disease, platelet aggregation reduction) Increase(Fish oil, blood viscosity) Increase(Fish oil, plate aggregation) Hypothesize a new relation a discovery! Treat(Fish oil, Rynaud s disease) Confirm with laboratory methods NLP in a Nutshell 19
LBD: a Real Example Hristovski et al. (2006) Their discovery pattern NLP in a Nutshell 20
Their method LBD: a Real Example Start with a disease X in mind Find physiological i l concepts Y s that frequently co occur with the disease X Extract relations between X and Y s Find concepts Z s co occur with Y s Extract relations between Z s and Y s Make hypotheses using discovery pattern BITOLA, BioMedLee, SemRep are used. NLP in a Nutshell 21
LBD: a Real Example What they found Treat(eicosanpentaenoic acid, Rynaud s) Treat(Treatment for diabetes, Rynaud s) NLP in a Nutshell 22
Conclusion Information Extraction is to extract structured information from unstructured text. IE methods have evolved from simpler methods to higher level NLP techniques. Challenges provide gold standard datasets for evaluation. IE systems can be used for literature based discovery. NLP in a Nutshell 23
References John McNaught, William J Black, Information Extraction, Text Mining for Biology and Biomedicine, 2006. Martin, E. P., et al., Analysis of Protein/Protein Interactions Through Biomedical Literature: Text Mining of Abstracts vs. Text Mining of Full Articles, Knowledge Exploration in Life Science Informatics, 2004. Kim, J., J. Park. BioIE: Retargetable information extraction and ontological annotation of biological interactions from the literature. Journal of Bioinformatics and Computational Biology 2, no. 3,551 568, 568, 2004. Katrin Fundel, Robert Kuffner, Ralf Zimmer, RelEx Relation extraction using dependency yp parse tree, Bioinformatics,, vol. 23, no. 3, 2007. Pierre Zweigenbaum, Dina Demner Fushman, Hong Yu, Kevin B. Cohen, Frontiers of biomedical text mining: current progress, Briefings in bioinformatics, vol. 8, no. 5, 358 375, 2007. Dimitar Hristovski, Carol Friedman, Thomas C Rindflesch, Borut Peterlin, Exploiting Semantic Relations for Literature Based Discovery, AMIA, 2006. NLP in a Nutshell 24
Thank you NLP in a Nutshell 25
Raynaud s Disease Raynaud's disease (RAY noz) is a vascular disorder [1] that affects blood flow to the extremities (the fingers, toes, nose and ears) when exposed to cold temperatures or in response to psychological stress. It is named for Maurice Raynaud (1834 1881), [2] a French physician who first described it in 1862. [3] NLP in a Nutshell 26
Huntington Disease An autosomal dominant inherited neurodegenerative disorder that is characterized by the insidious progressive development of mood disturbances, behavioral changes, involuntary choreiform movements and cognitive impairments. Onset is most commonly in adulthood, with a typical duration of 15 20 years before premature death. NLP in a Nutshell 27