Parameter learning in CRF s

Similar documents
Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

CMU-Q Lecture 24:

Lecture 9: PGM Learning

Probabilistic Graphical Models

Probabilistic Graphical Models

Logistic Regression. Machine Learning Fall 2018

Machine Learning for Structured Prediction

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Posterior Regularization

Probabilistic Graphical Models

Algorithms for Predicting Structured Data

Chapter 16. Structured Probabilistic Models for Deep Learning

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Conditional Random Field

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Linear & nonlinear classifiers

Graphical models for part of speech tagging

Information Extraction from Text

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Undirected Graphical Models: Markov Random Fields

Probabilistic Graphical Models

Random Field Models for Applications in Computer Vision

Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010.

Accelerated Training of Max-Margin Markov Networks with Kernels

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Introduction to Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Notes on Markov Networks

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

CPSC 540: Machine Learning

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Sequential Supervised Learning

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support vector machines Lecture 4

Undirected Graphical Models

Kernel Methods and Support Vector Machines

Introduction to SVM and RVM

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

y(x) = x w + ε(x), (1)

Payment Rules for Combinatorial Auctions via Structural Support Vector Machines

A brief introduction to Conditional Random Fields

with Local Dependencies

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Reliability Monitoring Using Log Gaussian Process Regression

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

GAUSSIAN PROCESS REGRESSION

14 : Theory of Variational Inference: Inner and Outer Approximation

Classification objectives COMS 4771

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Lecture 16 Deep Neural Generative Models

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Pattern Recognition 2018 Support Vector Machines

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

CSC 412 (Lecture 4): Undirected Graphical Models

Lecture 6: Graphical Models

Probabilistic Graphical Models

PAC-learning, VC Dimension and Margin-based Bounds

Conditional Random Fields: An Introduction

Inductive Principles for Restricted Boltzmann Machine Learning

Support Vector Machine (continued)

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Machine Learning for NLP

Linear Dynamical Systems

STA 4273H: Sta-s-cal Machine Learning

Support Vector Machines

Topics in Structured Prediction: Problems and Approaches

STA 4273H: Statistical Machine Learning

Naïve Bayes classification

Probabilistic Graphical Models

Lecture 10: A brief introduction to Support Vector Machine

Bayesian Interpretations of Regularization

Surrogate loss functions, divergences and decentralized detection

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Lecture : Probabilistic Machine Learning

3 : Representation of Undirected GM

Does Better Inference mean Better Learning?

Discriminative Models

Markov Networks.

CSCI-567: Machine Learning (Spring 2019)

Introduction to Logistic Regression and Support Vector Machine

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Introduction to Support Vector Machines

Introduction to Machine Learning Midterm, Tues April 8

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Overfitting, Bias / Variance Analysis

Nonparameteric Regression:

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Support Vector Machines for Classification: A Statistical Portrait

Lecture 15. Probabilistic Models on Graph

Max Margin-Classifier

Machine Learning: Logistic Regression. Lecture 04

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Sparse Kernel Machines - SVM

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Linear & nonlinear classifiers

Transcription:

Parameter learning in CRF s June 01, 2009

Structured output learning We ish to learn a discriminant (or compatability) function: F : X Y R (1) here X is the space of inputs and Y is the space of outputs. The space of functions F is of the form: F (x, y) = T φ(x, y) (2) The parameter is the parameter to be learned. It is learned by optimizing a certain Regularized Risk functional over the training data.

Notation Setting the notation: x denotes an element from the input space X. y denotes an element from the output space Y. Y = Y 1 Y 2 Y L, here each Y l = {1, 2,..., q}. So Y = q L. φ(x, y) is a feature mapping for the joint space of inputs and outputs Training data D = {(x n, y n )} N. n is used to index over training data.

Probablistic model The compatibility function F (x, y) = T φ(x, y) can be ritten in a probabilistic interpretation as: p(y x, ) = exp( T φ(x, y) ) y exp( T φ(x, y) ) (3) here Z(x, ) = y exp ( T φ(x, y) ) (4) is the partition function.

ML learning Maximum likelihood learning involves maximizing: equivalently ritten as: = arg max = arg max = arg max N p(y n x n, ) (5) N log ( p(y n x n, ) ) (6) N T φ(x n, y n ) log ( Z(x n, ) ) (7) ML learning is intractable in general because of the partition function evaluation.

CRF s CRF s induce a factorization of the probability p(y n x n, ), hich can make the partition function computation tractable. ( T ) p(y n x n, ) exp t T φ t (x n, y nt ) The index t is used to index over the cliques in the above equation. t=1 (8)

ML For certain factorizations (like chains or tree s), the partition function can be computed efficiently and exactly (using the sum-product algorithm). Thus ML learning is feasible for such CRF s. Recalling the ML maximization criteria: T φ(x n, y n ) = = arg max = arg min N [ T φ(x n, y n ) log ( Z(x n, ) )] (9) N [ log ( Z(x n, ) ) ] T φ(x n, y n ) (10) T t T φ t (x n, y nt ) (11) t=1 Z(x n, ) = T.. exp ( t T φ t (x n, y nt ) ) (12) y n1 y n2 t=1 y nl

MAP learning in CRF s MAP learning just adds a prior to the ML learning and is ritten as: { = arg min 2 + C N [ log ( Z(x n, ) ) ]} T φ(x n, y n ) (13) MAP learning can be interpreted as Regularized Risk Minimization: { = arg min 2 + C N } L(x n, y n, ) (14) L(x n, y n, ) = log ( Z(x n, ) ) T φ(x n, y n ) (15)

Kernel CRF s The above formulation of ML/MAP learning can be kernelized using the representer theorem. Refer (12.9) and (12.10) of Chapter 12 of the book (Predicting Structured Data). The CRF factorization helps in keeping the kernelized formulation tractable.

Learning CRF s: SVMs Learning using ML estimation is one ay of learning CRF s. ML estimation can be though of as a specific loss function over hich the Empirical Risk is being minimized. Other loss functions give different optimization algorithms. Recalling the rescaled margin SVM formulation: { 1 min 2 2 + C here: N [ ( max (yn, y) + F (x n, y) F (x n, y n ) )] } y y n + F (x n, y n ) = (16) T t T φ t (x n, y nt ) (17) t=1

Learning CRF s: SVMs... The rescaled margin SVM formulation can also be ritten as: min,ξ n { 1 2 2 + C N } ξ n (18) subject to: F (x n, y n ) F (x n, y) (y n, y) ξ n y, n (19) This formulation has exponentially many constraints, but can be solved in polynomial space using the cutting plane algorithm discussed last time. This algorithm doesnt take any advantage of the factorization provided by the CRF. (12.13)-(12.16) of the book provide an alternative ay of optimizing (18) hich uses the CRF factorization to achieve tractability.

Learning CRF s: CGM s Conditional Graphical Models is an alternative ay of training the CRF s. It makes use of the CRF factorization to upper bound the Empirical Risk as a sum of functions hich decomposes per clique.