Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Similar documents
Conditional Random Field

Undirected Graphical Models

Random Field Models for Applications in Computer Vision

Probabilistic Models for Sequence Labeling

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Graphical models for part of speech tagging

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

A brief introduction to Conditional Random Fields

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Probabilistic Graphical Models

Sequential Supervised Learning

Intelligent Systems (AI-2)

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Conditional Random Fields for Sequential Supervised Learning

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Lecture 13: Structured Prediction

Intelligent Systems (AI-2)

Conditional Random Fields: An Introduction

Machine Learning for Structured Prediction

CRF for human beings

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Lecture 6: Graphical Models

Log-Linear Models, MEMMs, and CRFs

Lecture 15. Probabilistic Models on Graph

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

6.867 Machine learning, lecture 23 (Jaakkola)

Bayesian Machine Learning - Lecture 7

CS Lecture 4. Markov Random Fields

Lecture 9: PGM Learning

Conditional Random Fields

CS838-1 Advanced NLP: Hidden Markov Models

Discriminative Fields for Modeling Spatial Dependencies in Natural Images

3 : Representation of Undirected GM

Probabilistic Graphical Models

Fall 2010 Graduate Course on Dynamic Learning

Introduction to Graphical Models

Hidden Markov Models in Language Processing

Hidden Markov Models

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

arxiv: v1 [stat.ml] 17 Nov 2010

Statistical Methods for NLP

Intelligent Systems (AI-2)

Structure Learning in Sequential Data

COMP90051 Statistical Machine Learning

STA 4273H: Statistical Machine Learning

Directed Probabilistic Graphical Models CMSC 678 UMBC

CS6220: DATA MINING TECHNIQUES

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

CSE 490 U Natural Language Processing Spring 2016

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Hidden Markov Models

Parameter learning in CRF s

A Graphical Model for Simultaneous Partitioning and Labeling

Probabilistic Graphical Models

Directed and Undirected Graphical Models

Hidden Markov Models

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

STA 4273H: Statistical Machine Learning

Computational Genomics

Probabilistic Graphical Models

Pattern Recognition and Machine Learning

Fast Inference and Learning with Sparse Belief Propagation

Probabilistic Graphical Models

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Undirected Graphical Models: Markov Random Fields

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Brief Introduction of Machine Learning Techniques for Content Analysis

lecture 6: modeling sequences (final part)

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Dynamic Approaches: The Hidden Markov Model

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Sparse Forward-Backward for Fast Training of Conditional Random Fields

Notes on Markov Networks

Introduction to Machine Learning Midterm, Tues April 8

Chapter 16. Structured Probabilistic Models for Deep Learning

Maximum Entropy Markov Models

Probabilistic Graphical Models

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

Statistical Processing of Natural Language

CPSC 540: Machine Learning

Information Extraction from Text

Graphical Models for Collaborative Filtering

Topics in Structured Prediction: Problems and Approaches

Hidden Markov Models Hamid R. Rabiee

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

13: Variational inference II

Learning MN Parameters with Approximation. Sargur Srihari

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

STA 414/2104: Machine Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Active learning in sequence labeling

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

CSC 412 (Lecture 4): Undirected Graphical Models

Graphical Models and Kernel Methods

Transcription:

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Outline Modeling Inference Training Applications

Outline Modeling Problem definition Discriminative vs. Generative Chain CRF General CRF Inference Training Applications

Problem Description X Given (observations), find (predictions) For example, X = { temperature, moisture, pressure,...} Y = { Sunny, Rainy, Stormy,...} Y Might depend on previous days and each other Might depend on previous days and each other

Problem Description The relational connection occurs in many applications, NLP, Computer Vision, Signal Processing,. Traditionally in graphical models, p( xy, ) = p( y x) p( x) Modeling the joint distribution can lead to difficulties rich local features occur in relational data, p( x) features may have complex dependencies, constructing probability distribution over them is difficult Solution: directly model the conditional, p( y x) is sufficient for classification! CRF is simply a conditional distribution p( y x) with an associated graphical structure

Discriminative Vs. Generative p( yx, ) Generative Model: A model that generate observed data randomly Naïve Bayes: once the class label is known, all the features are independent p( y x) Discriminative: Directly estimate the posterior probability; Aim at modeling the discrimination between different outputs MaxEnt classifier: linear combination of feature function in the exponent, Both generative models and discriminative models describe distributions over (y, x), but they work in different directions.

Discriminative Vs. Generative p( yx, ) p( y x) =observable =unobservable

Markov Random Field(MRF) and Factor Graphs On an undirected graph, the joint distribution of variables y 1 p( y) = ψc( yc), Z = ψc( yc) Z C y C factor ψ C( yc) 0:Potential function variable Typically : ψ C( yc) = exp{ E( yc)} Z :Partition function Not all distributions satisfy Markovian properties Hammersley-Clifford Theorem The ones which do can be factorized

Directed Graphical Models(Bayesian Network) Local conditional distributions If π () s indices of the parents of ys Generally used as generative models E.g. Naïve Bayes: once the class label is known, all the features are independent

Sequence prediction Like NER: identifying and classifying proper names in text, e.g. China as location; George Bush as people; United Nations as organizations Set of observation, Set of underlying sequence of states, HMM is generative: Transition probability Observation probability Doesn t model long-range dependencies Not practical to represent multiple interacting features (hard to model p(x)) The primary advantage of CRFs over hidden Markov models is their conditional nature, resulting in the relaxation of the independence assumption And it can handle overlapping features

Chain CRFs Each potential function will operate on pairs of adjacent label variables Parameters to be estimated, Feature functions =unobservable =observable

Chain CRF We can change it so that each state depends on more observations =unobservable =observable Or inputs at previous steps Or all inputs

General CRF: visualization If, and, ; is a CRF, if Y and are neighbors the MRF the CRF X fixed, observable, variables X (not in the MRF) Note that in a CRF we do not explicitly model any direct relationships between the observables (i.e., among the X) (Lafferty et al., 2001). Hammersley-Clifford does not apply to X!

General CRF: visualization Y CRF cliques (include only the unobservables, Y) X observables, X (not included in the cliques) Divide y MRF into cliques. The parameters inside each template are tied Φ ( y, x) --potential functions; functions for the template c c 1 Q(, ) 1 Q(, ) p( ) = yx e e, Q(, ) Z( ) = yx y x y x x e y Q(y,x) = c C Φ c (y c,x) Note that we are not summing over x in the denominator The cliques contain only unobservables (y); though, x is an argument to Φ c The probability P M (y x) is a joint distribution over the unobservables Y

General CRF: visualization A number of ad hoc modeling decisions are typically made with regard to the form of the potential functions. Φ c is typically decomposed into a weighted sum of feature sensors f i, producing: p( y x) = Q(y,x) = 1 Z c C Φ = e Q( yx, ) Φ c (y c,x) ( y, x) λ f ( y, x) c c i i c i F P( y x) Back to the chain-crf! Cliques can be identified as pairs of adjacent Ys: = 1 Z c Ci F e λ f ( y, x) i i c

Chain CRFs vs. MEMM Linear-chain CRFs were originally introduced as an improvement to MEMM Maximum Entropy Markov Models (MEMM) Transition probabilities are given by logistic regression Notice the per-state normalization Only dependent on the previous inputs; no dependence on the future states. Label-bias problem

CRFs vs. MEMM vs. HMM HMM MEMM CRF

Outline Modeling Inference General CRF Chain CRF Training Applications

Inference Given the observations,{xi})and parameters, we target to find the best state sequence For the general CRF: (, ) * 1 Φc yc x c C y = arg max P( y x) = arg max e = arg max Φc( yc, x) y y Z y c C For general graphs, the problem of exact inference in CRFs is intractable Approximate methods! A large literature

Inference in HMM Dynamic Programming: Forward Backward Viterbi 1 1 1 1 2 2 2 2 K K K K x1 x2 x 3 xk

Parameter Learning: Chain CRF Chain CRF could be done using dynamic programming Assume y Y Naively doing could be intractable: Define a matrix n Y with size

Parameter Learning: Chain CRF By defining the following forward and backward parameters,

Inference: Chain-CRF The inference of linear-chain CRF is very similar to that of HMM We can write the marginal distribution: Solve Chain-CRF using Dynamic Programming (Similar to Viterbi)! 1. First computing α for all t (forward), then compute β for all t (backward). 2. Return the marginal distributions computed. 3. Run viterbi to find the optimal sequence n. Y 2

Outline Modeling Inference Training General CRF Some notes on approximate learning Applications

Parameter Learning Given the training data, model. we wish to learn parameters of the For chain or tree structured CRFs, they can be trained by maximum likelihood The objective function for chain-crf is convex(see Lafferty et al(2001) ). General CRFs are intractable hence approximation solutions are necessary

Parameter Learning Given the training data, we wish to learn parameters of the mode. Conditional log-likelihood for a general CRF: Empirical Hard to calculate! Distribution It is not possible to analytically determine the parameter values that maximize the log-likelihood setting the gradient to zero and solving for λ does not always yield a closed form solution. (Almost always)

Parameter Learning This could be done using gradient descent Until we reach convergence Or any other optimization: λ max L ( λ; y x) max log p( y x; λ) λ Quasi-Newton methods: BFGS [Bertsekas,1999] or L-BFGS [Byrd, 1994] General CRFs are intractable hence approximation solutions are necessary Compared with Markov chains, CRF s should be more discriminative, much slower to train and possibly more susceptible to over-training. Regularization: θ 2 σ is a regularization parameter f objective (θ) = P θ (y x) λ N i= 1 λ i 1 λ + i α. L + λ ( λ; y x) L( λ + ; y x) L( λ; y x) < ò i 1 i 2σ 2

Training ( and Inference): General Case Approximate solution, to get faster inference. Treat inference as shortest path problem in the network consisting of paths(with costs) Max Flow-Min Cut (Ford-Fulkerson, 1956 ) Pseudo-likelihood approximation: Convert a CRF into separate patches; each consists of a hidden node and true values of neighbors; Run ML on separate patches Efficient but may over-estimate inter-dependencies Belief propagation?! sorry about that, man! variational inference algorithm it is a direct generalization of the exact inference algorithms for linear-chain CRFs Sampling based method(mcmc)

CRF frontiers Bayesian CRF: Because of the large number of parameters in typical applications of CRFs prone to overfitting. Regularization? Instead of Too complicated! How can we approximate this? Semi-supervised CRF: The need to have big labeled data! Unlike in generative models, it is less obvious how to incorporate unlabelled data into a conditional criterion, because the unlabelled data is a sample from the distribution p( x)

Outline Modeling Inference Training Some Applications

Some applications: Part-of-Speech-Tagging POS(part of speech) tagging; the identification of words as nouns, verbs, adjectives, adverbs, etc. Students need another break CRF features: noun verb article noun Feature Type Transition Description k,k y i = k and y i+1 =k Word k,w y i = k and x i =w k,w y i = k and x i-1 =w k,w y i = k and x i+1 =w k,w,w y i = k and x i =w and x i-1 =w k,w,w y i = k and x i =w and x i+1 =w Orthography: Suffix Orthography: Punctuation s in { ing, ed, ogy, s, ly, ion, tion, ity, } and k y i =k and x i ends with s k y i = k and x i is capitalized k y i = k and x i is hyphenated

Is HMM(Gen.) better or CRF(Disc.) If your application gives you good structural information such that could be easily modeled by dependent distributions, and could be learnt tractably, go the generative way! Ex. Higher-order emissions from individual states unobservables observables A A T C G Incorporating evolutionary conservation from an alignment: PhyloHMM, for which efficient decoding methods exist: states target genome informant genomes

References J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. ICML01, 2001. Charles Elkan, Log-linear Models and Conditional Random Field, Notes for a tutorial at CIKM, 2008. Charles Sutton and Andrew McCallum, An Introduction to Conditional Random Fields for Relational Learning, MIT Press, 2006 Slides: An Introduction to Conditional Random Field, Ching-Chun Hsiao Hanna M. Wallach, Conditional Random Fields: An Introduction, 2004 Sutton, Charles, and Andrew McCallum. An introduction to conditional random fields for relational learning. Introduction to statistical relational learning. MIT Press, 2006. Sutton, Charles, and Andrew McCallum. "An introduction to conditional random fields." arxiv preprint arxiv:1011.4088 (2010). B. Majoros, Conditional Random Fields, for eukaryotic gene prediction