CRF for human beings Arne Skjærholt LNS seminar CRF for human beings LNS seminar 1 / 29
Let G = (V, E) be a graph such that Y = (Y v ) v V, so that Y is indexed by the vertices of G. Then (X, Y) is a conditional random field in case, when conditioned on X, the random variables Y v obey the Markov property with respect to the graph: p(y v X, Y w, w v) = p(y v X, Y w, w v), where w v means that w and v are neighbors in G. CRF for human beings LNS seminar 2 / 29
Outline 1 Graphical models Directed Undirected 2 CRFs Parameter estimation Inference 3 Practicalities Training Experiments Constrained decoding CRF for human beings LNS seminar 3 / 29
What is this thing called graphical models? Set of random variables X A graph, describing the dependencies in X CRF for human beings LNS seminar 4 / 29
Conditional independence Key concept in graphical models Marginal independence: p(x Y ) = p(x ) CRF for human beings LNS seminar 5 / 29
Conditional independence Key concept in graphical models Marginal independence: p(x Y ) = p(x ) X and Y are conditionally independent, given Z iff p(x Y, Z) = p(x Z). CRF for human beings LNS seminar 5 / 29
Directed model, directed graph Each node is influenced by its parent nodes Any node conditionally independent of the rest of the graph, given its parents A S B CRF for human beings LNS seminar 6 / 29
Directed model, directed graph Each node is influenced by its parent nodes Any node conditionally independent of the rest of the graph, given its parents p(x) X X p(x X π ) A S B CRF for human beings LNS seminar 6 / 29
HMM q s q 1 q 2 q 3 q 4 q e w 1 w 2 w 3 w 4 CRF for human beings LNS seminar 7 / 29
HMM q s q 1 q 2 q 3 q 4 q e w 1 w 2 w 3 w 4 T p(q, W ) = p(q e q T ) p(q i q i 1 )p(w i q i ) i=1 CRF for human beings LNS seminar 7 / 29
MEMM q s q 1 q 2 q 3 q 4 q e w 1 w 2 w 3 w 4 CRF for human beings LNS seminar 8 / 29
MEMM q s q 1 q 2 q 3 q 4 q e w 1 w 2 w 3 w 4 T p(q W ) = p(q e q T ) p(q i q i 1, w i ) i=1 CRF for human beings LNS seminar 8 / 29
Undirected model, undirected graph Neighbouring nodes influence each other Any node conditionally independent of the rest of the graph, given its neighbours Impossible to model probabilities over each variable B C A D E CRF for human beings LNS seminar 9 / 29
Potential functions Defined over each maximal clique in the graph Strictly positive, real valued functions B A Ψ(X) Ψ c (c) C D c C(X) E CRF for human beings LNS seminar 10 / 29
Normalisation Probability is normalised potential p(x) 1 Z(X) Ψ(X) CRF for human beings LNS seminar 11 / 29
Normalisation Probability is normalised potential p(x) 1 Z(X) Ψ(X) Normalisation is sum of all possible potentials Z(X) Ψ(x) x Ω(X) CRF for human beings LNS seminar 11 / 29
, take two A CRF is an undirected graphical model, such that when conditioned on X the following property holds for all nodes in Y: for a pair of variables (Y v, Y w ) there exists a single neighbour Y w of Y v such that p(y v X, Y w ) = p(y v X, Y w ); that is, the nodes of Y must form a tree. CRF for human beings LNS seminar 12 / 29
Linear-chain CRFs Simplest possible tree? A linear chain. Augment chain with start and stop nodes Y 0 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 X 1 X 2 X 3 X 4 X 5 CRF for human beings LNS seminar 13 / 29
But what s the probability? We already know how to compute the p of a full graph, but what about the discriminative p needed for a CRF? The Clifford-Hammersley theorem: p(x, Y)/p(X) = exp Q(c) c C(Y) CRF for human beings LNS seminar 14 / 29
But what s the probability? (part deux) Since CRFs are tree-structured, biggest possible clique is two neighbouring nodes p(x Y) exp λ k f k (e, Y e, X) + µ k g k (v, Y v, X) e E,k v V,k CRF for human beings LNS seminar 15 / 29
Linear-chain probability Natural ordering of the edges Sum the potential from left to right CRF for human beings LNS seminar 16 / 29
Linear-chain probability Natural ordering of the edges Sum the potential from left to right T p(y X) = exp( θ j f j (Y i 1, Y i, i, X)) i=1 j = exp( j θ j F j (X, Y )) CRF for human beings LNS seminar 16 / 29
Likelihood Maximum likelihood estimation Log-likelihood: L(θ) = k log p θ(y (k) x (k) ) Continuous, concave function: solve L = 0 for maximum CRF for human beings LNS seminar 17 / 29
Likelihood Maximum likelihood estimation Log-likelihood: L(θ) = k log p θ(y (k) x (k) ) Continuous, concave function: solve L = 0 for maximum Enter the scary formulas CRF for human beings LNS seminar 17 / 29
Regularisation Using MLE tends to overfit the training data Smoothing log-linear models is done with penalty terms in the likelihood Lots of ways to do it, wapiti uses elastic net, a combination of Gaussian and Laplacian priors CRF for human beings LNS seminar 18 / 29
Regularisation Using MLE tends to overfit the training data Smoothing log-linear models is done with penalty terms in the likelihood Lots of ways to do it, wapiti uses elastic net, a combination of Gaussian and Laplacian priors L = L j ρ 1 θ j j ρ 2 2 θ2 k CRF for human beings LNS seminar 18 / 29
But what s the best labeling? Given a linear-chain CRF, inference is easy: Viterbi At each time step, a matrix: M t (q, q x) = exp(θ j f j (q, q, t, x)) Normalisation factor, Z(x) = ( T i=1 M t(x))) start,stop CRF for human beings LNS seminar 19 / 29
Back & forth Turns out, parameter estimation needs decoding as well (kinda) E p(y x (k) )[ ] requires probability for all possible labelings CRF for human beings LNS seminar 20 / 29
Back & forth Turns out, parameter estimation needs decoding as well (kinda) E p(y x (k) )[ ] requires probability for all possible labelings Forward-backward algorithm rescues us p(y = y x (k) )F j (x (k), y) y Q T T = p(y i 1 = q, Y i = q x)f j (q, q, x, i) i=1 q,q Q 2 CRF for human beings LNS seminar 20 / 29
Back & forth again { 1 if q is start α 0 (q x) = 0 otherwise α t (x) = α t 1 (x)m t (x) CRF for human beings LNS seminar 21 / 29
Back & forth again { 1 if q is start α 0 (q x) = 0 otherwise α t (x) = α t 1 (x)m t (x) { 1 if q is stop β T +1 (q x) = 0 otherwise β t (x) = M t+1 (x)β t+1 (x) CRF for human beings LNS seminar 21 / 29
Back & forth again { 1 if q is start α 0 (q x) = 0 otherwise α t (x) = α t 1 (x)m t (x) { 1 if q is stop β T +1 (q x) = 0 otherwise β t (x) = M t+1 (x)β t+1 (x) p(y t 1 = q, Y t = q x) = α t 1(q x)m t (q, q x)β t (q x) Z(x) CRF for human beings LNS seminar 21 / 29
Training models (and not dying of old age) Parameter estimation is expensive Wapiti supports several algorithms: L-BFGS, SGD, BCD, Rprop Best approach: combining several algorithms. CRF for human beings LNS seminar 22 / 29
Features Two kinds of features: unigram and bigram Best features for my MSD experiments: Word surface form Trivial unigram and bigram features Fixed length suffixes, length 1 10 CRF for human beings LNS seminar 23 / 29
The corpus PROIEL Latin corpus Small corpus (Bellum Gallicum): 25,000 tokens/1,300 sentences/350 labels Big corpus (Vulgata): 113,000 tokens/12,500 sentences/550 labels CRF for human beings LNS seminar 24 / 29
Results Experiment TE SE OOV IV HMM BG 15.7 % 86.5% 39.3% 11.1 % CRF BG 16.4 % 88.4% 37.2% 12.0 % HMM Vulgata 9.98% 48.6% 35.7% 7.97% CRF Vulgata 10.3 % 49.5% 34.5% 8.22% CRF for human beings LNS seminar 25 / 29
Theory For each input token, a function that returns the licenced labels for that token Find best label sequence, with only licenced labels Conceptually equivalent to walking the n-best list, and picking the first one licenced by the constraints CRF for human beings LNS seminar 26 / 29
The bad news Experiment TE SE OOV IV CRF BG 16.4 % 88.4% 37.2% 12.0 % CRF Vulgata 10.3 % 49.5% 34.5% 8.22% Constrained BG 18.0 % 89.7% 23.2% 16.9 % Constrained Vulgata 9.87% 49.9% 16.2% 9.33% CRF for human beings LNS seminar 27 / 29
The good news Experiment TE SE OOV IV HMM BG on Vulgata 37.5% 92.9% 67.1% 15.2% HMM Vulgata on BG 30.3% 96.9% 52.1% 17.6% HMM Mark & Matthew 39.1% 99.1% 58.4% 18.5% Constrained BG on Vulgata 25.8% 82.9% 39.5% 14.7% Constrained Vulgata on BG 27.0% 96.6% 34.8% 26.7% Constrained Mark & Matthew 28.9% 97.5% 37.0% 28.6% CRF for human beings LNS seminar 28 / 29
This and that Layered CRFs did not work well Dynamic CRFs look interesting CRF for human beings LNS seminar 29 / 29
This and that Layered CRFs did not work well Dynamic CRFs look interesting Questions? CRF for human beings LNS seminar 29 / 29