Parameter learning in CRF s June 01, 2009
Structured output learning We ish to learn a discriminant (or compatability) function: F : X Y R (1) here X is the space of inputs and Y is the space of outputs. The space of functions F is of the form: F (x, y) = T φ(x, y) (2) The parameter is the parameter to be learned. It is learned by optimizing a certain Regularized Risk functional over the training data.
Notation Setting the notation: x denotes an element from the input space X. y denotes an element from the output space Y. Y = Y 1 Y 2 Y L, here each Y l = {1, 2,..., q}. So Y = q L. φ(x, y) is a feature mapping for the joint space of inputs and outputs Training data D = {(x n, y n )} N. n is used to index over training data.
Probablistic model The compatibility function F (x, y) = T φ(x, y) can be ritten in a probabilistic interpretation as: p(y x, ) = exp( T φ(x, y) ) y exp( T φ(x, y) ) (3) here Z(x, ) = y exp ( T φ(x, y) ) (4) is the partition function.
ML learning Maximum likelihood learning involves maximizing: equivalently ritten as: = arg max = arg max = arg max N p(y n x n, ) (5) N log ( p(y n x n, ) ) (6) N T φ(x n, y n ) log ( Z(x n, ) ) (7) ML learning is intractable in general because of the partition function evaluation.
CRF s CRF s induce a factorization of the probability p(y n x n, ), hich can make the partition function computation tractable. ( T ) p(y n x n, ) exp t T φ t (x n, y nt ) The index t is used to index over the cliques in the above equation. t=1 (8)
ML For certain factorizations (like chains or tree s), the partition function can be computed efficiently and exactly (using the sum-product algorithm). Thus ML learning is feasible for such CRF s. Recalling the ML maximization criteria: T φ(x n, y n ) = = arg max = arg min N [ T φ(x n, y n ) log ( Z(x n, ) )] (9) N [ log ( Z(x n, ) ) ] T φ(x n, y n ) (10) T t T φ t (x n, y nt ) (11) t=1 Z(x n, ) = T.. exp ( t T φ t (x n, y nt ) ) (12) y n1 y n2 t=1 y nl
MAP learning in CRF s MAP learning just adds a prior to the ML learning and is ritten as: { = arg min 2 + C N [ log ( Z(x n, ) ) ]} T φ(x n, y n ) (13) MAP learning can be interpreted as Regularized Risk Minimization: { = arg min 2 + C N } L(x n, y n, ) (14) L(x n, y n, ) = log ( Z(x n, ) ) T φ(x n, y n ) (15)
Kernel CRF s The above formulation of ML/MAP learning can be kernelized using the representer theorem. Refer (12.9) and (12.10) of Chapter 12 of the book (Predicting Structured Data). The CRF factorization helps in keeping the kernelized formulation tractable.
Learning CRF s: SVMs Learning using ML estimation is one ay of learning CRF s. ML estimation can be though of as a specific loss function over hich the Empirical Risk is being minimized. Other loss functions give different optimization algorithms. Recalling the rescaled margin SVM formulation: { 1 min 2 2 + C here: N [ ( max (yn, y) + F (x n, y) F (x n, y n ) )] } y y n + F (x n, y n ) = (16) T t T φ t (x n, y nt ) (17) t=1
Learning CRF s: SVMs... The rescaled margin SVM formulation can also be ritten as: min,ξ n { 1 2 2 + C N } ξ n (18) subject to: F (x n, y n ) F (x n, y) (y n, y) ξ n y, n (19) This formulation has exponentially many constraints, but can be solved in polynomial space using the cutting plane algorithm discussed last time. This algorithm doesnt take any advantage of the factorization provided by the CRF. (12.13)-(12.16) of the book provide an alternative ay of optimizing (18) hich uses the CRF factorization to achieve tractability.
Learning CRF s: CGM s Conditional Graphical Models is an alternative ay of training the CRF s. It makes use of the CRF factorization to upper bound the Empirical Risk as a sum of functions hich decomposes per clique.