Extracting Information from Text

Similar documents
Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Lecture 13: Structured Prediction

Lecture 9: Hidden Markov Model

NLTK tagging. Basic tagging. Tagged corpora. POS tagging. NLTK tagging L435/L555. Dept. of Linguistics, Indiana University Fall / 18

Statistical methods in NLP, lecture 7 Tagging and parsing

HMM and Part of Speech Tagging. Adam Meyers New York University

Hidden Markov Models

Language Processing with Perl and Prolog

Probabilistic Context Free Grammars. Many slides from Michael Collins

Spatial Role Labeling CS365 Course Project

The Noisy Channel Model and Markov Models

A gentle introduction to Hidden Markov Models

Hidden Markov Models (HMMs)

Processing/Speech, NLP and the Web

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

NLP Programming Tutorial 11 - The Structured Perceptron

AN ABSTRACT OF THE DISSERTATION OF

Statistical Methods for NLP

Sequence Labeling: HMMs & Structured Perceptron

Probabilistic Context-free Grammars

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Part-of-Speech Tagging

Department of Computer Science and Engineering Indian Institute of Technology, Kanpur. Spatial Role Labeling

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 8 POS tagset) Pushpak Bhattacharyya CSE Dept., IIT Bombay 17 th Jan, 2012

Maxent Models and Discriminative Estimation

LECTURER: BURCU CAN Spring

A Context-Free Grammar

Advanced Natural Language Processing Syntactic Parsing

Applied Natural Language Processing

loods Joe Acanfora, Myron Su, David Keimig and Marc Evangelista CS4984: Computation Linguistics Virginia Tech, Blacksburg December 10, 2014

10/17/04. Today s Main Points

Lecture 7: Sequence Labeling

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

More Smoothing, Tuning, and Evaluation

CS460/626 : Natural Language

Lecture 12: Algorithms for HMMs

A Deterministic Word Dependency Analyzer Enhanced With Preference Learning

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Deep Learning for NLP

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Chapter 3: Basics of Language Modelling

CS838-1 Advanced NLP: Hidden Markov Models

Multiword Expression Identification with Tree Substitution Grammars

Latent Variable Models in NLP

Lecture 12: Algorithms for HMMs


Sequential Data Modeling - The Structured Perceptron

Introduction to the CoNLL-2004 Shared Task: Semantic Role Labeling

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Statistical Methods for NLP

with Local Dependencies

Natural Language Processing SoSe Words and Language Model

A Supertag-Context Model for Weakly-Supervised CCG Parser Learning

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Lecture 6: Part-of-speech tagging

Collapsed Variational Bayesian Inference for Hidden Markov Models

Part-of-Speech Tagging

Chunking with Support Vector Machines

CLRG Biocreative V

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

TALP at GeoQuery 2007: Linguistic and Geographical Analysis for Query Parsing

Fun with weighted FSTs

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

The SUBTLE NL Parsing Pipeline: A Complete Parser for English Mitch Marcus University of Pennsylvania

Features of Statistical Parsers

Today. Finish word sense disambigua1on. Midterm Review

Natural Language Processing

Automated Geoparsing of Paris Street Names in 19th Century Novels

Information Extraction from Text

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Natural Language Processing Winter 2013

Natural Language Processing

13A. Computational Linguistics. 13A. Log-Likelihood Dependency Parsing. CSC 2501 / 485 Fall 2017

Text Mining. March 3, March 3, / 49

Probabilistic Graphical Models

6.891: Lecture 16 (November 2nd, 2003) Global Linear Models: Part III

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Intelligent Systems (AI-2)

Linear Classifiers IV

Effectiveness of complex index terms in information retrieval

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Vine Pruning for Efficient Multi-Pass Dependency Parsing. Alexander M. Rush and Slav Petrov

NATURAL LANGUAGE PROCESSING. Dr. G. Bharadwaja Kumar

A brief introduction to Conditional Random Fields

Spatial Role Labeling: Towards Extraction of Spatial Relations from Natural Language

Maximum Entropy Models for Natural Language Processing

Natural Language Processing

Structured Output Prediction: Generative Models

Text Analytics (Text Mining)

Chemistry-specific Features and Heuristics for Developing a CRF-based Chemical Named Entity Recogniser

Context-based Natural Language Processing for. GIS-based Vague Region Visualization

CS 545 Lecture XVI: Parsing

Natural Language Processing

Natural Language Processing

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Transcription:

Extracting Information from Text Research Seminar Statistical Natural Language Processing Angela Bohn, Mathias Frey, November 25, 2010

Main goals Extract structured data from unstructured text Training & Evaluation Identify entities and relationships described in text (i.e. named entity recognition and relation extraction) Introduction 2 / 25

Structured vs. Unstructured Data Which organizations are located in Atlanta? Querying a database would be easy: SELECT * FROM o r g a n i z a t i o n WHERE UPPER( l o c a t i o n ) LIKE '%ATLANTA%' ;... whereas the real world looks like:..., said Ken Haldin, a spokesman for Georgia-Pacific in Atlanta Introduction 3 / 25

raw text sentences tokenized sentences sentence segmentation tokenization part of speech tagging entity recognition relation recognition relations (list of tuples) pos-tagged sentences chunked sentences Figure: Simple pipeline, BKL, 2009. NLP with Python. p 263 Introduction 4 / 25

Chaining NLTK s functions together >>> d e f i e p r e p r o c e s s ( document ) :... s = n l t k. s e n t t o k e n i z e ( document )... s = [ n l t k. w o r d t o k e n i z e ( s e n t ) f o r s e n t i n s ]... s = [ n l t k. p o s t a g ( s e n t ) f o r s e n t i n s ]... r e t u r n s >>> document = e x c e p t f o r A u s t r a l i a n Prime M i n i s t e r... J u l i a G i l l a r d, whose red and w h ite d i r n d l d r e s s... seemed more r e m i n i s c e n t o f the A u s t r i a n Alps... than the outback. >>> i e p r e p r o c e s s ( document ) [ [... ( ' e x c e p t ', ' IN ' ), ( ' f o r ', ' IN ' ), ( ' A u s t r a l i a n ', ' JJ ' ), ( ' Prime ', 'NNP ' ), ( ' M i n i s t e r ', 'NNP ' ), ( ' J u l i a ', 'NNP ' ),... ] ] Introduction 5 / 25

Usage of R and opennlp Import library and try a sentence detection. > library(opennlp) > library(opennlpmodels.en) > s <- "The little yellow dog barked at the cat." > s <- sentdetect(s, language = "en") > s [1] "The little yellow dog barked at the cat." Introduction 6 / 25

POS tagging POS tagging with tagpos(). Mind the dependence on the input language. > t <- tagpos(s, language = "en") > t [1] "The/DT little/jj yellow/jj dog/nn barked/vbn at/in the/dt cat./nn" Introduction 7 / 25

Chunking Segment and label multitoken sequences w o overlaps aka Partial parsing Pipeline is prerequisite of chunking sentence segmentation tokenization POS-tagging Chunking 8 / 25

Chunking Techniques NLTK s chunkers depend on Regular Expressions Unigrams, Bigrams, n-grams Classifier + Feature Extractor Chunking 9 / 25

Chunk Grammar: A regex approach >>> s e n t e n c e = [ ( the, DT ), ( l i t t l e, JJ ),... ( y e l l o w, JJ ), ( dog, NN ), ( barked,... VBD ), ( at, IN ), ( the, DT ), ( c a t, NN ) ] >>> grammar = NP: {<DT>?<JJ>*<NN>} >>> cp = n l t k. RegexpParser ( grammar ) >>> r e s u l t = cp. p a r s e ( s e n t e n c e ) >>> r e s u l t. draw ( ) S NP the DT little JJ yellow JJ dog NN barked VBD at IN NP the DT cat NN Chunking 10 / 25

Figure: Tag pattern manipulation with NLTK s chunkparser application. Chunking 11 / 25

>>> f o r t i n f i n d c n k ( 'CHUNK: {<VBN> <TO> <V.*>} ' ) :... p r i n t t (CHUNK d e l i g h t e d /VBN to /TO meet/vb) (CHUNK come/vbn to /TO t a l k /VB) (CHUNK used /VBN to /TO e x p r e s s /VB) (CHUNK g i v e n /VBN to /TO u n d e r s t a n d /VB)... Chunking 12 / 25 Exploring text corpora brown = n l t k. c o r p u s. brown >>> d e f f i n d c n k ( grammar ) :... cp = n l t k. RegexpParser ( grammar )... f o r s e n t i n brown. t a g g e d s e n t s ( ) :... t r e e = cp. p a r s e ( s e n t )... f o r s u b t r e e i n t r e e. s u b t r e e s ( ) :... i f s u b t r e e. node == 'CHUNK ' : y i e l d s u b t r e e

ChunkeR A chunker in R: >>> s e n t e n c e = [ ( t h e, DT ), ( l i t t l e, JJ ), ( y e l l o w, JJ ),... ( dog, NN ), ( barked, VBD ), ( a t, IN ), ( t h e, DT ), ( c a t, NN ) ] >>> grammar = NP: {<DT>?<JJ>*<NN>} >>> cp = n l t k. R e g e x p P a r s e r ( grammar ) >>> r e s u l t = cp. p a r s e ( s e n t e n c e ) >>> p r i n t r e s u l t ( S (NP t h e /DT l i t t l e / JJ y e l l o w / JJ dog /NN) barked /VBD a t / IN (NP the /DT cat /NN) ) > t [1] "The/DT little/jj yellow/jj dog/nn barked/vbn at/in the/dt cat./nn" > npchunker <- function(input) { + p <- "(\\<[^[:space:]]+/dt\\> )?((\\<[^[:space:]]+/jj.?\\> )*)(\\<[^[:space:]]+/nn\\>)" + r <- "(NP \\1\\2\\4\\5)" + output <- gsub(input, pattern = p, replacement = r) + output <- paste("(s ", output, ")", sep = "") + output + } > npchunker(t) [1] "(S (NP The/DT little/jj yellow/jj dog/nn) barked/vbn at/in (NP the/dt cat./nn))" Chunking 13 / 25

ChunkeR (2) Another chunker in R: grammar = r NP: {<DT PP\$>?<JJ>*<NN>} # chunk determiner / possessive, a d j e c t i v e s and nouns {<NNP>+} # chunk s e q u e n c e s o f p r o p e r nouns cp = n l t k. R e g e x p P a r s e r ( grammar ) s e n t e n c e = [ ( Rapunzel, NNP ), ( l e t, VBD ), ( down, RP ), [ 1 ] ( h e r, PP$ ), ( l o n g, JJ ), ( g o l d e n, JJ ), ( h a i r, NN ) ] >>> p r i n t cp. p a r s e ( s e n t e n c e ) [ 2 ] ( S (NP Rapunzel /NNP) l e t /VBD down/rp (NP h e r /PP$ l o n g / JJ g o l d e n / JJ h a i r /NN) ) > rapunzel <- tagpos("rapunzel let down her long golden hair.") > another.npchunker <- function(input) { + rule1 <- "(\\<[^[:space:]]+/(prp\\$ DT) )((\\<[^[:space:]]+/jj\\> )*)(\\<[^[:space:]]+/nn\\>)" + rule2 <- "(\\<[^[:space:]]+/nnp\\>)+" + output <- gsub(input, pattern = rule1, replacement = "(NP \\1\\3\\5)") + output <- gsub(output, pattern = rule2, replacement = "(NP \\1)") + output <- paste("(s ", output, ")", sep = "") + output + } > another.npchunker(rapunzel) [1] "(S (NP Rapunzel/NNP) let/vb down/rp (NP her/prp$ long/jj golden/jj hair./nn))" Chunking 14 / 25

Chinking Chinks are patterns excluded from chunks. grammar = r... NP:... {<.*>+} # chunk e v e r y t h i n g... }<VBD IN>+{ # c h i n k VBD and IN... >>> cp = n l t k. RegexpParser ( grammar ) >>> p r i n t cp. p a r s e ( s e n t e n c e ) ( S (NP the /DT l i t t l e / JJ y e l l o w / JJ dog/nn) barked /VBD at /IN (NP the /DT c a t /NN) ) Chunking 15 / 25

ChinkeR A chinker in R: > chinker <- function(input) { + p <- "(\\<.*\\>)+ (\\<[^[:space:]]+/vbn\\>) (\\<[^[:space:]]+/in\\>) (\\<.*\\>)+" + r <- "(NP \\1) \\2 \\3 (NP \\4)" + output <- gsub(input, pattern = p, replacement = r) + output <- paste("(s", output, ")", sep = "") + output + } > chinker(t) [1] "(S(NP The/DT little/jj yellow/jj dog/nn) barked/vbn at/in (NP the/dt cat./nn))" Chunking 16 / 25

IOB tags IOB tags are standard way to represent chunk structures in files with B marking a token as the beginning, I marking a token being inside, and O marking a token being outside of a chunk. We PRP B NP saw VBD O the DT B NP l i t t l e JJ I NP y e l l o w I NP dog NN I NP Chunking 17 / 25

IOB Tags > write.iob <- function(input, file) { + output <- npchunker(tagpos(input)) + output <- gsub(output, pattern = "^\\(S (.+)\\)$", replacement = "\\1") + output <- gsub(output, pattern = "(\\(NP [^\\)]+\\))", replacement = " \\1 ") + output <- gsub(output, pattern = "^\\ \\ $", replacement = "") + output <- unlist(strsplit(output, split = "[[:space:]]?\\ [[:space:]]?")) + output <- strsplit(output, split = " ") + annotate <- function(x) { + if (length(grep(x, pattern = "\\(NP")) == 0) { + y <- paste(gsub(x, pattern = "/", replacement = " "), "O") + } + if (length(grep(x, pattern = "\\(NP")) > 0) { + y <- paste(gsub(x, pattern = "/", replacement = " "), "I-NP") + y[2] <- gsub(y[2], pattern = "I-NP", replacement = "B-NP") + y <- y[2:length(y)] + } + y <- gsub(y, pattern = "( [[:upper:]]{2,3})(\\))( [[:upper:]])", replacement = "\\1\\3") + y + } + output <- lapply(output, annotate) + unlink(file) + cat(unlist(output), file = file, append = TRUE, sep = "\n") + } > write.iob(s, file = "output.txt") Chunking 18 / 25

write.iob() s output.txt The DT B NP l i t t l e JJ I NP y e l l o w JJ I NP dog NN I NP barked VBN O at IN O the DT B NP c a t. NN I NP Chunking 19 / 25

Evaluation against training corpus Establishing a baseline without a grammar. (Notice that 43.4 % of our evaluation corpus tokens are outside of chunks.) >>> from n l t k. c o r p u s import c o n l l 2 0 0 0 as ev >>> cp = n l t k. RegexpParser ( ) >>> t e s t s e n t s = ev. c h u n k e d s e n t s (... ' t e s t. t x t ', c h u n k t y p e s =[ 'NP ' ] ) >>> p r i n t cp. e v a l u a t e ( t e s t s e n t s ) ChunkParse s c o r e : IOB Accuracy : 43.4% P r e c i s i o n : 0.0% R e c a l l : 0.0% F Measure : 0.0% Evaluation 20 / 25

Evaluation against training corpus (2) >>> grammar = r NP: {<[CDJNP].* >+} >>> cp = n l t k. RegexpParser ( grammar ) >>> p r i n t cp. e v a l u a t e ( t e s t s e n t s ) ChunkParse s c o r e : IOB Accuracy : 87.7% P r e c i s i o n : 70.6% R e c a l l : 67.8% F Measure : 69.2% Precision, Recall, F-Measure NP! NP chunked correctly..! chunked correctly.. Evaluation 21 / 25

Named Entity Recognition (NER) There are two approaches Gazetteer, dictionary Classifier Named Entities and Relations 22 / 25

Figure: Error-prone location detection with gazetteer, BKL, 2009. NLP with Python. p 282 Named Entities and Relations 23 / 25

Relation extraction Based on identified named entities, regular expression Named Entities and Relations 24 / 25

Exploring text corpora >>> import r e >>> IN = r e. c o m p i l e ( r '. * \ b i n \b (?! \ b.+ i n g ) ' ) >>> f o r doc i n n l t k. c o r p u s. i e e r. p a r s e d d o c s ( ' NYT 19980315 ' ) :... f o r r e l i n n l t k. sem. e x t r a c t r e l s ( 'ORG ', 'LOC ', doc, c o r p u s= ' i e e r ', p a t t e r n=in )... p r i n t n l t k. sem. s h o w r a w r t u p l e ( r e l ) [ORG: 'DDB Needham ' ] ' i n ' [ LOC : 'New York ' ] [ORG: ' Kaplan T h a l e r Group ' ] ' i n ' [ LOC : 'New York ' ] [ORG: 'BBDO South ' ] ' i n ' [ LOC : ' A t l a n t a ' ] [ORG: ' Georgia P a c i f i c ' ] ' i n ' [ LOC : ' A t l a n t a ' ] Named Entities and Relations 25 / 25