Conditional Random Fields

Similar documents
Conditional Random Field

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Lecture 13: Structured Prediction

Feature selection. Micha Elsner. January 29, 2014

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

A brief introduction to Conditional Random Fields

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Statistical Methods for NLP

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Intelligent Systems (AI-2)

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Log-Linear Models, MEMMs, and CRFs

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Undirected Graphical Models

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Natural Language Processing

Basic math for biology

Intelligent Systems (AI-2)

Pair Hidden Markov Models

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Hidden Markov Models

Graphical models for part of speech tagging

Hidden Markov Models in Language Processing

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

MACHINE LEARNING 2 UGM,HMMS Lecture 7

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Brief Introduction of Machine Learning Techniques for Content Analysis

Probabilistic Graphical Models

Midterm sample questions

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Markov Networks.

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

with Local Dependencies

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Lecture 7: Sequence Labeling

Probabilistic Models for Sequence Labeling

Introduction to Machine Learning CMU-10701

Administrivia. What is Information Extraction. Finite State Models. Graphical Models. Hidden Markov Models (HMMs) for Information Extraction

Statistical Machine Learning from Data

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Hidden Markov Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Data-Intensive Computing with MapReduce

Lecture 12: Algorithms for HMMs

Stephen Scott.

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Information Extraction from Text

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

Math 350: An exploration of HMMs through doodles.

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

CSE 490 U Natural Language Processing Spring 2016

Probabilistic Graphical Models

COMP90051 Statistical Machine Learning

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Structure Learning in Sequential Data

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

CS1820 Notes. hgupta1, kjline, smechery. April 3-April 5. output: plausible Ancestral Recombination Graph (ARG)

Sequential Supervised Learning

Hidden Markov Models. x 1 x 2 x 3 x N

10/17/04. Today s Main Points

Random Field Models for Applications in Computer Vision

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

An Introduction to Machine Learning

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay

Intelligent Systems (AI-2)

Hidden Markov Models

CRF for human beings

STA 414/2104: Machine Learning

Statistical Methods for NLP

Soft Inference and Posterior Marginals. September 19, 2013

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Undirected Graphical Models: Markov Random Fields

Hidden Markov Models. Three classic HMM problems

Dynamic Approaches: The Hidden Markov Model

Lecture 12: Algorithms for HMMs

STA 4273H: Statistical Machine Learning

CS838-1 Advanced NLP: Hidden Markov Models

Hidden Markov Models

Approximate Inference

Hidden Markov Models

Word Alignment III: Fertility Models & CRFs. February 3, 2015

Linear Dynamical Systems

Conditional Random Fields for Sequential Supervised Learning

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

1. Markov models. 1.1 Markov-chain

CSC321 Lecture 18: Learning Probabilistic Models


Machine Learning for Structured Prediction

Hidden Markov Models

Machine Learning for OR & FE

Transcription:

Conditional Random Fields Micha Elsner February 14, 2013

2 Sums of logs Issue: computing α forward probabilities can undeflow Normally we d fix this using logs But α requires a sum of probabilities Not easy to do in log-space Solution 1: scale/exponentiate def logplus(xx, yy): M = max(xx, yy) return M + log(exp(xx - M) + exp(yy - M)) Because: Subtracting M in logspace is equivalent to dividing So we have x M + y M (one of these is 1) Doesn t underflow Computes 1 M (x + y) Then take the log again, and multiply by M to get log(x + y)

3 Solution 2 Normalize the α values at each timestep. At a single timestep, we d get: A(i) = t α t (i) = P(W 1:i ) Divide all the α t (i) by A(i) α t(i) = α t(i) A(i) = P(T i = t W 1:i ) Now the α are not so small (since they have to sum to 1). Applying this recursively, we get that: A (i) = t α t(i) = P(W i ) We can extract the probability of all the words: log P(W 1:n ) = log i A (i) = i log A (i)

4 Conditional random fields A model which handles sequences Like the HMM And can use many correlated features Like the max-entropy classifier By combining the two ideas Text here: Sutton and McCallum An introduction to conditional random fields Has everything you want to know Relatively accessible Really, really long Basics: sections 2.2-2.4 Applications: 2.5-2.7 Algorithms: 4.1, 4.3, 5.1

5 Review: generative vs conditional Generative model Model of the observed data (features F, tag T ) Params maximize P(F, T ; θ) The joint probability of all observations T f 1 f2 f3 Conditional model Model of the tag as a function of the data Params maximize P(T F; θ) The probability of the tag T f 1 f2 f3

6 Deriving the conditional model Move away from having a probability over feature values Instead have weights for feature values Want to optimize using gradient (calculus) Because weights are correlated so estimation isn t obvious Easier to take derivatives if everything is a sum So: use exp log to transform products into sums End up with dot product

7 Making an HMM into a dot product n i=0 P(T 1:n W 1:n ) = P trans(t i+1 T i )P emit (W i T i ) P(W 1:n ) = exp log n i=0 P trans(t i+1 T i )P emit (W i T i ) P(W 1:n ) = exp ( n i=0 log P trans(t i+1 T i ) log P emit (W i T i ) ) P(W 1:n ) Let f t t (T ) = (T i+1 = t, T i = t) and θ t t = log P(t t) Let f t w (T ) = (T i = t, W i = w) and θ t w = log P(w t) Defines a feature vector F(T ) (specific to tags T ) = exp(θ F(T )) P(W 1:n )

8 Making it conditional exp(θ F(T )) P(T W ) = P(W 1:n ) exp(θ F(T )) = T 1:n exp(θ F(T )) To make this conditional: Drop constraint that θ are log probabilities Choose θ to maximize probability of training P(T W ) Normalizer is no longer interpretable as P(W 1:n ) Z (Θ) or Z = T 1:n exp(θ F (T )) The normalizing constant or partition function Notation Z and partition name from statistical physics

9 (Linear-chain) CRF P(T W ) = 1 Z exp (Θ F(T )) Graphical model: T 1 T 2 T 3 T 4 W 1 W 2 W 3 W 4 Undirected graphical model No arrows Arcs in the graph represent weights No interpretation of arc as conditional prob... Just a number that gets multiplied!

10 Adding features T 1 T 2 T 3 T 4 f 1a f 1b f 2b f 3b f 4b f 2a f 3a f 4a We can add features of the word just like in max-entropy Capitalization, numeric, suffix, etc

11 Inference What we want: max P(T 1:n W 1:n ) exp Θ F(T ) We can simply maximize the weight Θ F(T ) This breaks down into a sum of the f features and θ weights Can maximize with Viterbi Write initial weight of 0 under <s> On an arc, add transition and emission θ µ t (i) = max t θ t w + θ t t + µ t (i 1)

12 Estimation Using gradient: As in max-entropy model, gradient is difference of: Count of feature f in true labelings (from numerator) Expected count of feature f (from denominator) Numerator term is easy (just count!) Denominator term is harder E P(θ) [f t t(t )] = T,W D P(T W ; Θ)#(f t t(t )) Sum over all possible tag sequences of probability times whether feature is active Can compute using forward-backward

13 Forward-backward for CRF Want to compute: T,W D P(T W ; Θ)#(f t t(t )) Basic question: probability that f t t is active at time i 1 2 <s> a b Very similar to tag marginals! b b a P(T i = t, T i+1 = t) α t (i)θ t tθ t wi+1 β t (i + 1)

14 Feature design Consequence: we can compute gradient easily for: Features which look at a single T i T 1:n And any words W (since they don t depend on T ) Features which look at T i, T i+1 And any words W Hard(er) to compute for: Features which look at more than two tags Features for non-adjacent tags These require altering the dynamic program somehow Features which only look at W and not T don t do anything: Since we only care about P(T W )

15 CRF training can be slow To compute the gradient, we have to run forward-backward CRF training can be as slow as your HMM program times your max-ent program Using someone else s software highly recommended

16 CRFSuite Recommended software for the project: Like Megam, you transform your data into a feature file It learns the weights Data format: LABEL FEAT FEAT... LABEL FEAT FEAT... LABEL FEAT FEAT... <----- end of sentence CRFSuite automatically computes θ values for: All θ t t (based on the labels you use) All θ t f (based on the features you use) No way to specify other feature types such as t, t, w in this software

17 Using CRFs as chunkers Typical problem: find and label subsequences of text: Named entity recognition Pick out all strings referring to specific individual in the world Identify entity type (person/organization/place/date...) [ PERS Mary ] works for [ ORG The Ohio State University ] Sequence labeling: Because labeling of name words affected by context [ PLACE Ohio ], and a university

18 Standard problem setup Four tags per NE label BEGIN-ORG, IN-ORG, LAST-ORG, UNIT-ORG (one word) And neutral OUTSIDE tag Students OUT at OUT Ohio BEGIN-ORG State IN-ORG University LAST-ORG are OUT studying OUT Students OUT at OUT OSU UNIT-ORG are OUT studying OUT (Sometimes called BILOU coding) Could just use ORG and OUT, but not as good

19 Relation extraction Extracting Relation Descriptors with Conditional Random Fields, Li, Jiang, Chieu, Chai 2011 Task: extract entity pairs that have some relationship Employment: said ARG-1, a vice president at ARG-2 Personal: ARG-1 later married ARG-2 Basic solution: treat as sequence labeling with tags ARG-1-EMPLOYMENT, ARG-2-EMPLOYMENT, ARG-1-PERSONAL etc Using BILOU-like coding scheme

20 Issues Task has long-distance dependencies (only one ARG-1 and ARG-2 per relation per sentence, etc) Markov property violated Modify dynamic program by adding some long-distance features Exponential slowdown, but this is research, I guess Sequence of words between ARG tags Standard CRF scores 73% F-score Modified CRF up to 80% No timing figures in paper...

21 Hindi NER Rapid Development of Hindi Named Entity Recognition using Conditional Random Fields and Feature Induction, Li and McCallum 2003 Find all PERSON, LOCATION, ORGANIZATION in Hindi news: Throw in word, characters from word, prefix/suffix, lists of Hindi NEs Use stepwise feature selection procedure Add batch of features Calculate approximate gain in LL from each one Keep the good ones Repeat Search over features, conjunctions of features 70% F-score

22 Transliterating Chinese Forward-backward Machine Transliteration between English and Chinese Based on Combined CRFs, Qin and Chen 2011 Workshop on Named Entities Use two CRFs: First to find chunks of characters that represent a sound BILOU-like encoding Second to label each chunk with some foreign characters Results aren t great: En->Zh: 70% characters correct Zh->En: 76% characters correct Whole name about 30% of the time