VI.3 Rule-Based Information Extraction

Similar documents
Statistical Methods for NLP

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

Intelligent Systems (AI-2)

Intelligent Systems (AI-2)

COMP90051 Statistical Machine Learning

Statistical Methods for NLP

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Conditional Random Field

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013


What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Hidden Markov Models

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Hidden Markov Models Hamid R. Rabiee

Graphical models for part of speech tagging

Lecture 11: Hidden Markov Models

Hidden Markov Models

Lecture 13: Structured Prediction

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Model and Speech Recognition

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Dynamic Approaches: The Hidden Markov Model

Hidden Markov Models

1. Markov models. 1.1 Markov-chain

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Hidden Markov Models

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Dynamic Programming: Hidden Markov Models

Statistical Processing of Natural Language

Statistical Sequence Recognition and Training: An Introduction to HMMs

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov Models

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Brief Introduction of Machine Learning Techniques for Content Analysis

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models in Language Processing

Hidden Markov Models NIKOLAY YAKOVETS

Administrivia. What is Information Extraction. Finite State Models. Graphical Models. Hidden Markov Models (HMMs) for Information Extraction

Probabilistic Models for Sequence Labeling

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Log-Linear Models, MEMMs, and CRFs

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Hidden Markov Models

order is number of previous outputs

Sequence modelling. Marco Saerens (UCL) Slides references

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Lecture 15. Probabilistic Models on Graph

Hidden Markov Models

STA 414/2104: Machine Learning

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Lecture 12: Algorithms for HMMs

Statistical Machine Learning from Data

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Hidden Markov Models

Statistical Methods for NLP

Lecture 3: ASR: HMMs, Forward, Viterbi

Sequential Supervised Learning

Lecture 12: Algorithms for HMMs

Machine Learning for Structured Prediction

Hidden Markov Models. x 1 x 2 x 3 x N

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

STA 4273H: Statistical Machine Learning

Information Extraction from Text

O 3 O 4 O 5. q 3. q 4. Transition

Active learning in sequence labeling

Hidden Markov Models

Hidden Markov Models (HMMs)

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Parametric Models Part III: Hidden Markov Models

Conditional Random Fields: An Introduction

HMM part 1. Dr Philip Jackson

Computational Genomics and Molecular Biology, Fall

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Data Mining in Bioinformatics HMM

Hidden Markov Models

Advanced Natural Language Processing Syntactic Parsing

Machine Learning for natural language processing

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Modelling

Conditional Random Fields for Sequential Supervised Learning

CS838-1 Advanced NLP: Hidden Markov Models

Statistical NLP: Hidden Markov Models. Updated 12/15

L23: hidden Markov models

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

CS 188: Artificial Intelligence Fall 2011

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Temporal Modeling and Basic Speech Recognition

Introduction to Artificial Intelligence (AI)

CS 7180: Behavioral Modeling and Decision- making in AI

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Hidden Markov Models (HMMs)

Hidden Markov Models. George Konidaris

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Transcription:

VI.3 Rule-Based Information Extraction Goal: Identify and extract unary, binary, or n-ary relations as facts embedded in regularly structured text, to generate entries in a schematized database Rule-driven regular expression matching Interpret documents from source (e.g., Web site to be wrapped) as regular language, and specify/infer rules for matching specific types of facts Title & & & & & & Year& The%Shawshank%Redemption% % % 1994% The%Godfather% % % % % 1972% The%Godfather%C%Part%II% % % 1974% Pulp%Fiction% % % % % 1994% The%Good,%the%Bad,%and%the%Ugly % 1966!42

LR Rules L token (left neighbor) fact token R token (right neighbor) pre-filler pattern filler pattern post-filler pattern Example: L = <B>, R = </B> Country L = <I>, R = </I> Code produces relation with tuples <Congo,%242>,%<Egypt,%20>,%<France,%30> <HTML>% % <TITLE>Some%Country%Codes</TITLE>%% <BODY>% % <B>Congo</B><I>242</I><BR>% % <B>Egypt</B><I>20</I><BR>% % <B>France</B><I>30</I><BR>% </BODY>% </HTML> Rules are often very specific and therefore combined/generalized Full details: RAPIER [Califf and Mooney 03]!43

Advanced Rules: HLRT, OCLR, NHLRT, etc. Idea: Limit application of LR rules to proper context (e.g., to skip over HTML table header) <TABLE>% & & & HLRT rules (head left token right tail) apply LR rule only if inside HT (e.g., H = <TD> T = </TD>) OCLR rules (open (left token right)* close): O and C identify tuple, LR repeated for individual elements NHLRT (nested HLRT): apply rule at current nesting level, open additional levels, or return to higher level <TR><TH><B>Country</B></TH><TH><I>Code</I></TH></TR>& <TR><TD><B>Congo</B></TD><TD><I>242</I></TD></TR>& <TR><TD><B>Egypt</B></TD><TD><I>20</I></TD></TR>& & <TR><TD><B>France</B></TD><TD><I>30</I></TD></TR>& </TABLE>!44

Learning Regular Expressions Input: Hand-tagged examples of a regular language Learn: (Restricted) regular expression for the language of a finite-state transducer that reads sentences of the language and outputs token of interest Example: This apartment has 3 bedrooms. <BR> The monthly rent is $ 995. This apartment has 4 bedrooms. <BR> The monthly rent is $ 980. The number of bedrooms is 2. <BR> The rent is $ 650 per month. yields * <digit> * <BR> * $ <digit>+ * as learned pattern Problem: Grammar inference for full-fledged regular languages is hard. Focus therefore often on restricted class of regular languages. Full details: WHISK [Soderland 99]!45

Properties and Limitations of Rule-Based IE Powerful for wrapping regularly structured web pages (e.g., template-based from same deep web site) Many complications with real-life HTML (e.g., misuse of tables for layout) Flat view of input limits the same annotation Consider hierarchical document structure (e.g., DOM tree, XHTML) Learn extraction patterns for restricted regular languages (e.g., combinations of XPath and first-order logic) Regularities with exceptions are difficult to capture Learn positive and negative cases (and use statistical models)!46

Additional Literature for VI.3 M. E. Califf and R. J. Mooney: Bottom-Up Relational Learning of Pattern Matching Rules for Information Extraction, JMLR 4:177-210, 2003 S. Soderland: Learning Information Extraction Rules for Semi-Structured and Free Text, Machine Learning 34(1-3):233-272, 1999!47

VI.4 Learning-Based Information Extraction For heterogeneous sources and for natural-language text NLP techniques (PoS tagging, parsing) for tokenization Identify patterns (regular expressions) as features Train statistical learners for segmentation and labeling (e.g., HMM, CRF, SVM, etc.) augmented with lexicons Use learned model to automatically tag new input sequences Training data: The WWW conference takes place in Banff in Canada Today s keynote speaker is Dr. Berners-Lee from W3C The panel in Edinburgh chaired by Ron Brachman from Yahoo! with event, location, person, and organization annotations!48

IE as Boundary Classification Idea: Learn classifiers to recognize start token and end token for the facts under consideration. Combine multiple classifiers (ensemble learning) for more robust output. Example: There will be a talk by Alan Turing at the University at 4 PM. Prof. Dr. James Watson will speak on DNA at MPI at 6 PM. The lecture by Francis Crick will be in the IIF at 3 PM. person place time! Classifiers test each token (with PoS tag, LR neighbor tokens, etc. as features) for two classes: begin-fact, end-fact!49

Text Segmentation and Labeling Idea: Observed text is concatenation of structured record with limited reordering and some missing fields Example: Addresses and bibliographic records Author Year Title Journal Volume Page House number Building Road City State Zip 4089 Whispering Pines Nobel Drive San Diego CA 92122 P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237. Source: [Sarawagi 08]!50

Hidden Markov Models (HMMs) Assume that the observed text is generated by a regular grammar with some probabilistic variation (i.e., stochastic FSA = Markov Model) Each state corresponds to a category (e.g., noun, phone number, person) that we seek to label in the observed text Each state has a known probability distribution over words that can be output by this state The objective is to identify the state sequence (from a start to an end state) with maximum probability of generating the observed text Output (i.e., observed text) is known, but the state sequence cannot be observed, hence the name Hidden Markov Model!51

Hidden Markov Models Hidden Markov Model (HMM) is a discrete-time, finite-state Markov model consisting of state space S = {s1,, sn} and the state in step t is denoted as X(t) initial state probabilities pi (i = 1,, n) transition probabilities pij : S S [0,1], denoted p(si sj) output alphabet Σ = {w1,, wm} state-specific output probabilities qik : S Σ [0,1], denoted q(si wk) Probability of emitting output sequence o1,, ot Σ T X TY p(x i 1! x i ) q(x i " o i ) with p(x0 xi) = p(xi) x 1,...,x T 2S i=1!52

HMM Example Goal: Label the tokens in the sequence Max-Planck-Institute, Stuhlsatzenhausweg 85 with the labels Name, Street, Number Σ = { MPI, St., 85 } S = {Name, Street, Number} pi = {0.6, 0.3, 0.1} // output alphabet // (hidden) states // initial state probabilities 0.1 0.3 0.3 0.6 0.2 0.4 0.1 Start Name Street Number End 0.5 0.4 0.2 0.4 0.1 0.7 0.8 1.0 0.2 0.3 MPI St. 85 0.4!53

Three Major Issues with HMMs Compute probability of output sequence for known parameters Forward/Backward computation Compute most likely state sequence for given output and known parameters (decoding) Viterbi algorithm (using dynamic programming) Estimate parameters (transition probabilities and output probabilities) from training data (output sequences only) Baum-Welch algorithm (specific form of Expectation Maximization)! Full details: [Rabiner 90]!54

Forward Computation Probability of emitting output sequence o1,, ot Σ T is X TY p(x i 1! x i ) q(x i " o i ) with p(x0 xi) = p(xi) x 1,...,x T 2S i=1 Naïve computation would require O(n T ) operations! Iterative forward computation with clever caching and reuse of intermediate results ( memoization ) requires O(n 2 T) operations Let αi(t) = P[o1,, ot-1, X(t) = i] denote the probability of being in state i at time t and having already emitted the prefix output o1,, ot-1 Begin: Induction: i (1) = p i j (t + 1) = nx i (t) p(s i! s j ) p(s i " o t ) i=1!55

Backward Computation Probability of emitting output sequence o1,, ot Σ T is X TY p(x i 1! x i ) q(x i " o i ) with p(x0 xi) = p(xi) x 1,...,x T 2S i=1 Naïve computation would require O(n T ) operations! Iterative backward computation with clever caching and reuse of intermediate results ( memoization ) Let βi(t) = P[ot+1,, ot, X(t) = i] denote the probability of being in state i at time t and having already emitted the suffix output ot+1,, ot Begin: Induction: i(t )=1 j(t 1) = nx i=1 i(t) p(s j! s i ) p(s i " o t )!56

Trellis Diagramm for HMM Example t=1 t=2 t=3 Name Name Name Start Street Street Street End Number Number Number MPI St 85 Forward probabilities: αname (1) = 0.6 αstreet (1) = 0.3 αnumber (1) = 0.1 αname (2) = 0.6 * 0.7 * 0.2 + 0.3 * 0.2 * 0.2 + 0.1 * 0.0 * 0.1 = 0.096 0.6 0.3 Start Name Street Number End 0.7 0.1 0.2 0.4 0.1 0.2 0.5 0.4 0.2 0.3 0.3 0.1 0.4 0.8 1.0 MPI St. 85 0.4!57

Larger HMM for Bibliographic Records Source: [Chakrabarti 09]!58

Viterbi Algorithm Goal: Identify state sequence x1,, xt most likely of having generated the observed output o1,, ot Viterbi algorithm (dynamic programming) i(1) = p i i(1) = 0 for t = 1,, T j(t + 1) = // highest probability of being in state i at step 1 // highest-probability predecessor of state i max i=1,...,n i (t) p(x i! x j ) q(x i " o t ) j(t + 1) = arg max i=1,...,n i (t) p(x i! x j ) q(x i " o t ) // probability // state Most likely state sequence can be obtained by means of backtracking through the memoized values δi (t) and ψi (t)!59

Training of HMMs Simple case: If fully tagged training sequences are available, we can use MLE to estimate the parameters!! p(s i! s j )= q(s i! w k )= # transitions s i! s j P s k # transitions s i! s k # outputs s i! w k P w l # outputs s i! w l Standard case: Training with unlabeled sequences (i.e., observed output only, state sequence is unknown) Baum-Welch algorithm (variant of Expectation Maximization) Note: There exists some work on learning the structured of HMMs (# states, connections, etc.), but this remains very difficult and computationally expensive!60

Problems and Extensions of HMMs Individual output letters/words may not show learnable patterns output words can be entire lexical classes (e.g., numbers, zip code, etc.) Geared for flat sequences, not for structured text documents use nested HMMs where each state can hold another HMM Cannot capture long-range dependencies (Markov property) (e.g., in addresses: with first word being Mr. or Mrs., the probability of seeing a P.O. box later decreases substantially) Cannot easily incorporate multiple complex word features (e.g., isyear(w), isdigit(w), allcaps(w), etc.) Conditional Random Fields (CRFs) address these limitations!61

Additional Literature for VI.4 S. Chakrabarti: Extracting, Searching, and Mining Annotations on the Web, Tutorial, WWW 2009 (http://www2009.org/pdf/t10-f%20extracting.pdf) L. R. Rabiner: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Readings in Speech Recognition, 1990!62

VI.5 Named Entity Reconciliation Problem 1: Same entity appears in Different spellings (incl. misspellings, abbreviations, etc.) (e.g., brittnee spears vs. britney spears) Different levels of completeness (e.g., joe hellerstein vs. prof. joseph m. hellerstein) Problem 2: Different entities happen to look the same (e.g., george w. bush vs. george h. w. bush) Problems even occur in structured databases and require data cleaning when integrating multiple databases Integrating heterogeneous databases or Deep Web sources also requires schema matching (aka. data integration)!63

Entity Reconciliation Techniques Edit distance measures (both strings and records) Exploit context information for higher-confidence matchings (e.g., publications/co-authors of Dave Dewitt and D. J. DeWitt) Exploit reference dictionaries as ground truth (e.g., for address cleaning) Propagate matching confidence values in link-/reference-based graph structure Statistical learning in (probabilistic) graphical models (also: joint disambiguation of multiple mentions onto most compact/consistent set of entities)!64

Entity Reconciliation by Matching Functions Fellegi-Sunter Model as framework for entity reconciliation Input: Two sets A and B of strings or records each with features (e.g., n-grams, attributes, etc.) Method: Define family γi : A B {0,1} (i = 1,, k) of attribute comparisons or similarity tests (aka. matching functions) Identify matching pairs M A B and non-matching pairs U A B as training data and compute mi = P[γi (a,b) = 1 (a,b) M] and ui = P[γi (a,b) = 1 (a,b) U] For pairs (a,b) A B \ (M U), consider a and b equivalent if kx m i i(a, b) for user-defined threshold τ u i i=1 Full details: [Fellegi and Sunter 69]!65

Additional Literature for VI.5 I. Fellegi and A. Sunter: A Theory for Record Linkage, Journal of the American Statistical Association 64(328):1183-1210, 1969!66