Information Extraction from Text

Similar documents
Sequential Supervised Learning

Intelligent Systems (AI-2)

Graphical models for part of speech tagging

Conditional Random Field

Intelligent Systems (AI-2)

Brief Introduction of Machine Learning Techniques for Content Analysis

Machine Learning for Structured Prediction

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Conditional Random Fields for Sequential Supervised Learning

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Parameter learning in CRF s

Probabilistic Models for Sequence Labeling

Lecture 13: Structured Prediction

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Intelligent Systems (AI-2)

Undirected Graphical Models

Logistic Regression & Neural Networks

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Probabilistic Graphical Models

Random Field Models for Applications in Computer Vision

Log-Linear Models, MEMMs, and CRFs

Probabilistic Graphical Models

10-701/ Machine Learning - Midterm Exam, Fall 2010

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Statistical Methods for NLP

Machine Learning for natural language processing

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Hierachical Name Entity Recognition

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

CS Lecture 18. Topic Models and LDA

Introduction to Machine Learning Midterm, Tues April 8

Probabilistic Graphical Models

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

CS6220: DATA MINING TECHNIQUES

Hidden Markov Models

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Sequences and Information

Click Prediction and Preference Ranking of RSS Feeds

Department of Computer Science and Engineering Indian Institute of Technology, Kanpur. Spatial Role Labeling

Transfer Learning for Adaptive Relation Extraction

A brief introduction to Conditional Random Fields

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Conditional Random Fields

The Noisy Channel Model and Markov Models

Logistic Regression. Machine Learning Fall 2018

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Active learning in sequence labeling

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Notes on Noise Contrastive Estimation (NCE)

Machine Learning Overview

Feature Engineering. Knowledge Discovery and Data Mining 1. Roman Kern. ISDS, TU Graz

Domain Adaptation for Word Sense Disambiguation under the Problem of Covariate Shift

Linear Classifiers IV

Pattern Recognition and Machine Learning

Towards Maximum Geometric Margin Minimum Error Classification

Machine Learning for NLP

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Topic Models and Applications to Short Documents

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Support Vector Machines for Classification: A Statistical Portrait

CSE 546 Final Exam, Autumn 2013

Generative Models for Classification

Bowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations

CRF for human beings

Midterm sample questions

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

CS145: INTRODUCTION TO DATA MINING

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Text Mining. March 3, March 3, / 49

Generative v. Discriminative classifiers Intuition

Algorithms for Predicting Structured Data

Structure Learning in Sequential Data

Fast Logistic Regression for Text Categorization with Variable-Length N-grams

Learning Features from Co-occurrences: A Theoretical Analysis

CS6220: DATA MINING TECHNIQUES

Machine Learning for NLP

Support Vector Machines

Discriminative Models

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

Classification, Linear Models, Naïve Bayes

Machine Learning Practice Page 2 of 2 10/28/13

arxiv: v1 [stat.ml] 17 Nov 2010

Part 4: Conditional Random Fields

18.9 SUPPORT VECTOR MACHINES

CSCI-567: Machine Learning (Spring 2019)

Hidden Markov Models

Latent Dirichlet Allocation Introduction/Overview

Transcription:

Information Extraction from Text Jing Jiang Chapter 2 from Mining Text Data (2012) Presented by Andrew Landgraf, September 13, 2013 1

What is Information Extraction? Goal is to discover structured information from unstructured text Early research was on template filling (e.g. Police report) AOL merged with Time Warner in 2000 IBM Watson, Google, Wolfram Alpha use information extraction Information extraction can be broken down into Named Entity Recognition and Relation Extraction 2

Named Entity Recognition Task is to identify real-world named entities and classify them into entity types (e.g. person, organization, location) Challenge in words that can have multiple entity types (e.g. JFK) Most fundamental task in information extraction Building block for relation extraction, Q & A, search engine queries Rule-based systems came first but are expensive and domain dependent 3

Statistical Learning for Named Entity Recognition Sequence labeling is used in many natural language processing tasks Have a sequence of observations x i, i = 1... n and corresponding labels y i x i is a feature vector for the i th word y i is dependent on not just x i but also x N (i) and y N (i) Each word in a sentence is treated as an observation x Class labels, y i, identify boundaries and types of named entities 4

Hidden Markov Models (HMMs) Treat y as hidden states, maximize p(x, y) A first order hidden Markov model is defined as follows p(y i y [ i], x) = p(y i y i 1 ) p(x i x [ i], y) = p(x i y i ) p(x, y) = p(y i y i 1 )p(x i y i ) 5

Maximum Entropy Markov Models Model the conditional distribution of y given x Discriminative models have been more successful than generative models p(y x) = p(y i x Nl (i), y i 1 ) Functional form implies an exponential model ( ) exp j λ jf j (y i, x Nl (i), y i 1 ) p(y i x Nl (i), y i 1 ) = ( ) y exp j λ jf j (y, x Nl (i), y i 1 ) L-BFGS is a common method to training the model 6

Conditional Random Fields CRFs have undirected edges and the current label can depend on previous and future labels Linear chain CRF: p(y x) exp i λ j f j (y i, y i 1, x, i) j More difficult to train because the normalizing constant is a sum over all possible label sequences 7

Relation Extraction Detect and characterize the semantic relations between two entities Limit the problem to relations between two entities in the same sentence Possible relation types include: physical, personal/social, employment/ affiliation, etc. Different methods for this task include feature-based, kernel-based, and weakly supervised 8

Feature-based Relation Classification If a pair of known entities co-occur in a sentence, it is a candidate for a relationship An additional relation type of nil is added Required to have a training data set in which the relation types are hand-labeled After adding independent variables (next slide) to the data set, train a multi-category classifier Multinomial logistic regression, multi-category SVM, discriminant analysis, etc. Can also separate the detection of a relationship and the characterization into two separate classifications 9

Feature-based Relation Classification Independent variables (features) are created by various domain specific feature engineering methods Features related to the entity (e.g. If the entity is a person, family is a likely relation type) Contextual features (e.g. If the word founded appears, especially if a person entity is the subject and an organization entity is the object) Outside data (e.g. Do the two entities co-occur in the same Wikipedia article?) 10

Kernel Methods for Relation Extraction In typical SVMs, h : R p R d, d p f(x) = β 0 + h(x) T β n = β 0 + α i y i K(x, x i ) i=1 where K(, ) is the positive definite kernel corresponding to h The kernel function between two sentences is large if the two sentences are similar is some way Shortest dependency paths (e.g. protestors seized stations and troops raided churches both have the form person verb facility ) Tree-based kernels 11

Weakly Supervised Relation Extraction It is expensive/time consuming to label all the relations in a corpus These methods use require less training data Bootstrapping (not the usual kind) With a small amount of data learn relations Predict new relations Use the predicted relations you are sure about as new training data Distant Supervision Supplement data with labeled examples from other corpora (e.g. Wikipedia, Freebase) 12

Evaluation Manually annotated test documents from the same domain have to be used Precision: percent correct among those predicted to be positive Recall: percent of all positives that were identified as such F-1: geometric mean of precision and recall 13