CRF for human beings

Similar documents
Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Conditional Random Field

Undirected Graphical Models

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

A brief introduction to Conditional Random Fields

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Conditional Random Fields: An Introduction

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Probabilistic Models for Sequence Labeling

Graphical models for part of speech tagging

Log-Linear Models, MEMMs, and CRFs

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Undirected Graphical Models: Markov Random Fields

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

3 : Representation of Undirected GM

Graphical models. Sunita Sarawagi IIT Bombay

Graphical Models Another Approach to Generalize the Viterbi Algorithm

Sequential Supervised Learning

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Introduction to Graphical Models

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

Probabilistic Graphical Models

Learning Parameters of Undirected Models. Sargur Srihari

COMP90051 Statistical Machine Learning

9 Forward-backward algorithm, sum-product on factor graphs

Semi-Markov/Graph Cuts

Review: Directed Models (Bayes Nets)

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CPSC 540: Machine Learning

CSC 412 (Lecture 4): Undirected Graphical Models

Directed and Undirected Graphical Models

Gibbs Fields & Markov Random Fields

Lecture 9: PGM Learning

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Chris Bishop s PRML Ch. 8: Graphical Models

Statistical Approaches to Learning and Discovery

Hidden Markov Models

Undirected graphical models

Graphical Models. Outline. HMM in short. HMMs. What about continuous HMMs? p(o t q t ) ML 701. Anna Goldenberg ... t=1. !

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Linear Dynamical Systems

Intelligent Systems (AI-2)

Graphical Models for Collaborative Filtering

Machine Learning 4771

Learning Parameters of Undirected Models. Sargur Srihari

Fall 2010 Graduate Course on Dynamic Learning

STA 414/2104: Machine Learning

Directed Probabilistic Graphical Models CMSC 678 UMBC

Undirected Graphical Models

Structure Learning in Sequential Data

Statistical Methods for NLP

STA 4273H: Statistical Machine Learning

Probabilistic Graphical Models (I)

Intelligent Systems (AI-2)

Gibbs Field & Markov Random Field

Graphical Models and Kernel Methods

Machine Learning 4771

Dynamic Approaches: The Hidden Markov Model

Active learning in sequence labeling

Graphical Models Seminar

Conditional Random Fields for Sequential Supervised Learning

CS Lecture 4. Markov Random Fields

Midterm sample questions

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Conditional Independence and Factorization

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

CS281A/Stat241A Lecture 19

Basic math for biology

Probabilistic Graphical Models

Lecture 4 October 18th

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Random Field Models for Applications in Computer Vision

Probabilistic Graphical Models

Hidden Markov Models

Parameter learning in CRF s

Statistical Learning

Hidden Markov Models in Language Processing

Introduction to Machine Learning Midterm, Tues April 8

CS838-1 Advanced NLP: Hidden Markov Models

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Chapter 16. Structured Probabilistic Models for Deep Learning

Introduction to Machine Learning CMU-10701

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Markov properties for undirected graphs

Bayesian Machine Learning - Lecture 7

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A gentle introduction to Hidden Markov Models

5. Sum-product algorithm

Learning MN Parameters with Approximation. Sargur Srihari

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Lecture 4: State Estimation in Hidden Markov Models (cont.)

Information Extraction from Text

Probabilistic Graphical Models

Representation of undirected GM. Kayhan Batmanghelich

Markov Networks.

The Origin of Deep Learning. Lili Mou Jan, 2015

Transcription:

CRF for human beings Arne Skjærholt LNS seminar CRF for human beings LNS seminar 1 / 29

Let G = (V, E) be a graph such that Y = (Y v ) v V, so that Y is indexed by the vertices of G. Then (X, Y) is a conditional random field in case, when conditioned on X, the random variables Y v obey the Markov property with respect to the graph: p(y v X, Y w, w v) = p(y v X, Y w, w v), where w v means that w and v are neighbors in G. CRF for human beings LNS seminar 2 / 29

Outline 1 Graphical models Directed Undirected 2 CRFs Parameter estimation Inference 3 Practicalities Training Experiments Constrained decoding CRF for human beings LNS seminar 3 / 29

What is this thing called graphical models? Set of random variables X A graph, describing the dependencies in X CRF for human beings LNS seminar 4 / 29

Conditional independence Key concept in graphical models Marginal independence: p(x Y ) = p(x ) CRF for human beings LNS seminar 5 / 29

Conditional independence Key concept in graphical models Marginal independence: p(x Y ) = p(x ) X and Y are conditionally independent, given Z iff p(x Y, Z) = p(x Z). CRF for human beings LNS seminar 5 / 29

Directed model, directed graph Each node is influenced by its parent nodes Any node conditionally independent of the rest of the graph, given its parents A S B CRF for human beings LNS seminar 6 / 29

Directed model, directed graph Each node is influenced by its parent nodes Any node conditionally independent of the rest of the graph, given its parents p(x) X X p(x X π ) A S B CRF for human beings LNS seminar 6 / 29

HMM q s q 1 q 2 q 3 q 4 q e w 1 w 2 w 3 w 4 CRF for human beings LNS seminar 7 / 29

HMM q s q 1 q 2 q 3 q 4 q e w 1 w 2 w 3 w 4 T p(q, W ) = p(q e q T ) p(q i q i 1 )p(w i q i ) i=1 CRF for human beings LNS seminar 7 / 29

MEMM q s q 1 q 2 q 3 q 4 q e w 1 w 2 w 3 w 4 CRF for human beings LNS seminar 8 / 29

MEMM q s q 1 q 2 q 3 q 4 q e w 1 w 2 w 3 w 4 T p(q W ) = p(q e q T ) p(q i q i 1, w i ) i=1 CRF for human beings LNS seminar 8 / 29

Undirected model, undirected graph Neighbouring nodes influence each other Any node conditionally independent of the rest of the graph, given its neighbours Impossible to model probabilities over each variable B C A D E CRF for human beings LNS seminar 9 / 29

Potential functions Defined over each maximal clique in the graph Strictly positive, real valued functions B A Ψ(X) Ψ c (c) C D c C(X) E CRF for human beings LNS seminar 10 / 29

Normalisation Probability is normalised potential p(x) 1 Z(X) Ψ(X) CRF for human beings LNS seminar 11 / 29

Normalisation Probability is normalised potential p(x) 1 Z(X) Ψ(X) Normalisation is sum of all possible potentials Z(X) Ψ(x) x Ω(X) CRF for human beings LNS seminar 11 / 29

, take two A CRF is an undirected graphical model, such that when conditioned on X the following property holds for all nodes in Y: for a pair of variables (Y v, Y w ) there exists a single neighbour Y w of Y v such that p(y v X, Y w ) = p(y v X, Y w ); that is, the nodes of Y must form a tree. CRF for human beings LNS seminar 12 / 29

Linear-chain CRFs Simplest possible tree? A linear chain. Augment chain with start and stop nodes Y 0 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 X 1 X 2 X 3 X 4 X 5 CRF for human beings LNS seminar 13 / 29

But what s the probability? We already know how to compute the p of a full graph, but what about the discriminative p needed for a CRF? The Clifford-Hammersley theorem: p(x, Y)/p(X) = exp Q(c) c C(Y) CRF for human beings LNS seminar 14 / 29

But what s the probability? (part deux) Since CRFs are tree-structured, biggest possible clique is two neighbouring nodes p(x Y) exp λ k f k (e, Y e, X) + µ k g k (v, Y v, X) e E,k v V,k CRF for human beings LNS seminar 15 / 29

Linear-chain probability Natural ordering of the edges Sum the potential from left to right CRF for human beings LNS seminar 16 / 29

Linear-chain probability Natural ordering of the edges Sum the potential from left to right T p(y X) = exp( θ j f j (Y i 1, Y i, i, X)) i=1 j = exp( j θ j F j (X, Y )) CRF for human beings LNS seminar 16 / 29

Likelihood Maximum likelihood estimation Log-likelihood: L(θ) = k log p θ(y (k) x (k) ) Continuous, concave function: solve L = 0 for maximum CRF for human beings LNS seminar 17 / 29

Likelihood Maximum likelihood estimation Log-likelihood: L(θ) = k log p θ(y (k) x (k) ) Continuous, concave function: solve L = 0 for maximum Enter the scary formulas CRF for human beings LNS seminar 17 / 29

Regularisation Using MLE tends to overfit the training data Smoothing log-linear models is done with penalty terms in the likelihood Lots of ways to do it, wapiti uses elastic net, a combination of Gaussian and Laplacian priors CRF for human beings LNS seminar 18 / 29

Regularisation Using MLE tends to overfit the training data Smoothing log-linear models is done with penalty terms in the likelihood Lots of ways to do it, wapiti uses elastic net, a combination of Gaussian and Laplacian priors L = L j ρ 1 θ j j ρ 2 2 θ2 k CRF for human beings LNS seminar 18 / 29

But what s the best labeling? Given a linear-chain CRF, inference is easy: Viterbi At each time step, a matrix: M t (q, q x) = exp(θ j f j (q, q, t, x)) Normalisation factor, Z(x) = ( T i=1 M t(x))) start,stop CRF for human beings LNS seminar 19 / 29

Back & forth Turns out, parameter estimation needs decoding as well (kinda) E p(y x (k) )[ ] requires probability for all possible labelings CRF for human beings LNS seminar 20 / 29

Back & forth Turns out, parameter estimation needs decoding as well (kinda) E p(y x (k) )[ ] requires probability for all possible labelings Forward-backward algorithm rescues us p(y = y x (k) )F j (x (k), y) y Q T T = p(y i 1 = q, Y i = q x)f j (q, q, x, i) i=1 q,q Q 2 CRF for human beings LNS seminar 20 / 29

Back & forth again { 1 if q is start α 0 (q x) = 0 otherwise α t (x) = α t 1 (x)m t (x) CRF for human beings LNS seminar 21 / 29

Back & forth again { 1 if q is start α 0 (q x) = 0 otherwise α t (x) = α t 1 (x)m t (x) { 1 if q is stop β T +1 (q x) = 0 otherwise β t (x) = M t+1 (x)β t+1 (x) CRF for human beings LNS seminar 21 / 29

Back & forth again { 1 if q is start α 0 (q x) = 0 otherwise α t (x) = α t 1 (x)m t (x) { 1 if q is stop β T +1 (q x) = 0 otherwise β t (x) = M t+1 (x)β t+1 (x) p(y t 1 = q, Y t = q x) = α t 1(q x)m t (q, q x)β t (q x) Z(x) CRF for human beings LNS seminar 21 / 29

Training models (and not dying of old age) Parameter estimation is expensive Wapiti supports several algorithms: L-BFGS, SGD, BCD, Rprop Best approach: combining several algorithms. CRF for human beings LNS seminar 22 / 29

Features Two kinds of features: unigram and bigram Best features for my MSD experiments: Word surface form Trivial unigram and bigram features Fixed length suffixes, length 1 10 CRF for human beings LNS seminar 23 / 29

The corpus PROIEL Latin corpus Small corpus (Bellum Gallicum): 25,000 tokens/1,300 sentences/350 labels Big corpus (Vulgata): 113,000 tokens/12,500 sentences/550 labels CRF for human beings LNS seminar 24 / 29

Results Experiment TE SE OOV IV HMM BG 15.7 % 86.5% 39.3% 11.1 % CRF BG 16.4 % 88.4% 37.2% 12.0 % HMM Vulgata 9.98% 48.6% 35.7% 7.97% CRF Vulgata 10.3 % 49.5% 34.5% 8.22% CRF for human beings LNS seminar 25 / 29

Theory For each input token, a function that returns the licenced labels for that token Find best label sequence, with only licenced labels Conceptually equivalent to walking the n-best list, and picking the first one licenced by the constraints CRF for human beings LNS seminar 26 / 29

The bad news Experiment TE SE OOV IV CRF BG 16.4 % 88.4% 37.2% 12.0 % CRF Vulgata 10.3 % 49.5% 34.5% 8.22% Constrained BG 18.0 % 89.7% 23.2% 16.9 % Constrained Vulgata 9.87% 49.9% 16.2% 9.33% CRF for human beings LNS seminar 27 / 29

The good news Experiment TE SE OOV IV HMM BG on Vulgata 37.5% 92.9% 67.1% 15.2% HMM Vulgata on BG 30.3% 96.9% 52.1% 17.6% HMM Mark & Matthew 39.1% 99.1% 58.4% 18.5% Constrained BG on Vulgata 25.8% 82.9% 39.5% 14.7% Constrained Vulgata on BG 27.0% 96.6% 34.8% 26.7% Constrained Mark & Matthew 28.9% 97.5% 37.0% 28.6% CRF for human beings LNS seminar 28 / 29

This and that Layered CRFs did not work well Dynamic CRFs look interesting CRF for human beings LNS seminar 29 / 29

This and that Layered CRFs did not work well Dynamic CRFs look interesting Questions? CRF for human beings LNS seminar 29 / 29