Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Similar documents
Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Conditional Random Field

Probabilistic Models for Sequence Labeling

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

CRF for human beings

Graphical Models for Collaborative Filtering

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Intelligent Systems (AI-2)

Logistic Regression & Neural Networks

Graphical models for part of speech tagging

A brief introduction to Conditional Random Fields

Statistical Methods for NLP

Intelligent Systems (AI-2)

Sequential Supervised Learning

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Log-Linear Models, MEMMs, and CRFs

A gentle introduction to Hidden Markov Models

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Sequence Labeling: HMMs & Structured Perceptron

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Undirected Graphical Models

Generative v. Discriminative classifiers Intuition

Introduction to Logistic Regression and Support Vector Machine

Probabilistic Graphical Models & Applications

Machine Learning, Fall 2012 Homework 2

Basic math for biology

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Hidden Markov Models

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression. Machine Learning Fall 2018

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Lecture 13: Structured Prediction

p(d θ ) l(θ ) 1.2 x x x

Hidden Markov Models

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Logistic Regression. William Cohen

Generative v. Discriminative classifiers Intuition

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Learning Parameters of Undirected Models. Sargur Srihari

Bias-Variance Tradeoff

Probabilistic Graphical Models

Lecture 12: Algorithms for HMMs

Linear and logistic regression

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Part 4: Conditional Random Fields

Week 5: Logistic Regression & Neural Networks

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

COMP90051 Statistical Machine Learning

Conditional Random Fields for Sequential Supervised Learning

Introduction to Machine Learning

Lecture 12: Algorithms for HMMs

Machine Learning

Lecture 9: PGM Learning

Hidden Markov Models

Regression with Numerical Optimization. Logistic

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Machine Learning for NLP

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Statistical Methods for NLP

Sequential Data Modeling - Conditional Random Fields

Learning Parameters of Undirected Models. Sargur Srihari

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Lecture 7: Sequence Labeling

Stochastic Gradient Descent

Active learning in sequence labeling

Support Vector Machines

Advanced Data Science

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Conditional Random Fields: An Introduction

CSCI-567: Machine Learning (Spring 2019)

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

CPSC 540: Machine Learning

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

CSE 490 U Natural Language Processing Spring 2016

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Statistical NLP: Hidden Markov Models. Updated 12/15

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Statistical Data Mining and Machine Learning Hilary Term 2016

Introduction to Machine Learning Midterm, Tues April 8

Conditional Random Fields

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Machine Learning for natural language processing

Hidden Markov Models

with Local Dependencies

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Structure Learning in Sequential Data

Transcription:

Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015

Announcement A2 is out. Due Oct 20 at 1pm. 2

Outline Hidden Markov models: shortcomings Generative vs. discriminative models Linear-chain CRFs Inference and learning algorithms with linear-chain CRFs 3

Hidden Markov Model Graph specifies how join probability decomposes Q 1 Q 2 Q 3 Q 4 Q 5 O 1 O 2 O 3 O 4 O 5 T 1 P(O, Q) = P Q 1 P(Q t+1 Q t ) Initial state probability t=1 State transition probabilities T t=1 P(O t Q t ) Emission probabilities 4

Shortcomings of Standard HMMs How do we add more features to HMMs? Might be useful for POS tagging: Word position within sentence (1 st, 2 nd, last ) Capitalization Word prefix and suffixes (-tion, -ed, -ly, -er, re-, de-) Features that depend on more than the current word or the previous words. 5

Possible to Do with HMMs Add more emissions at each timestep Q 1 O 12 O 11 O 13 O 14 Clunky Word identity Capitalization Prefix feature Word position Is there a better way to do this? 6

Discriminative Models HMMs are generative models Build a model of the joint probability distribution P(O, Q), Let s rename the variables Generative models specify P(X, Y; θ gen ) If we are only interested in POS tagging, we can instead train a discriminative model Model specifies P(Y X; θ disc ) Now a task-specific model for sequence labelling; cannot use it for generating new samples of word and POS sequences 7

Generative or Discriminative? Naive Bayes P y x = P y i P x i y / P( x) Logistic regression P(y x) = 1 Z ea 1x 1 + a 2 x 2 + + a n x n + b 8

Discriminative Sequence Model The parallel to an HMM in the discriminative case: linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001) P Y X = 1 Z X exp t Z(X) is a normalization constant: k θ k f k (y t, y t 1, x t ) sum over all features sum over all time-steps Z X = y exp t k θ k f k (y t, y t 1, x t ) sum over all possible sequences of hidden states 9

Intuition Standard HMM: product of probabilities; these probabilities are defined over the identity of the states and words Transition from state DT to NN: P(y t+1 = NN y t = DT) Emit word the from state DT: P(x t = the y t = DT) Linear-chain CRF: replace the products by numbers that are NOT probabilities, but linear combinations of weights and feature values. 10

Features in CRFs Standard HMM probabilities as CRF features: Transition from state DT to state NN f DT NN (y t, y t 1, x t ) = 1(y t 1 = DT) 1(y t = NN) Emit the from state DT f DT the (y t, y t 1, x t ) = 1(y t = DT) 1(x t = the) Initial state is DT f DT y 1, x 1 = 1 y 1 = DT Indicator function: Let 1 condition = 1 if condition is true 0 otherwise 11

Features in CRFs Additional features that may be useful Word is capitalized f cap (y t, y t 1, x t ) = 1(y t =? )1(x t is capitalized) Word ends in ed f ed (y t, y t 1, x t ) = 1(y t =? )1(x t ends with ed) Exercise: propose more features 12

Inference with LC-CRFs Dynamic programming still works modify the forward and the Viterbi algorithms to work with the weightfeature products. HMM LC-CRF Forward algorithm P X θ Z(X) Viterbi algorithm argmax Y P(X, Y θ) argmax Y P(Y X, θ) 13

Forward Algorithm for HMMs Create trellis α i (t) for i = 1 N, t = 1 T α j 1 = π j b j (O 1 ) for j = 1 N for t = 2 T: for j = 1 N: N α j t = α i t 1 a ij b j (O t ) i=1 N P O θ = α j (T) j=1 Runtime: O(N 2 T) 14

Forward Algorithm for LC-CRFs Create trellis α i (t) for i = 1 N, t = 1 T α j 1 = exp k θ k init f k init (y 1 = j, x 1 ) for j = 1 N for t = 2 T: for j = 1 N: Z(X) = α j t = N j=1 N i=1 α j (T) Runtime: O(N 2 T) α i t 1 exp k θ k f k (y t = j, y t 1, x t ) Transition and emission probabilities replaced by exponent of weighted sums of features. Having Z(X) allows us to compute P(Y X) 15

Viterbi Algorithm for HMMs Create trellis δ i (t) for i = 1 N, t = 1 T δ j 1 = π j b j (O 1 ) for j = 1 N for t = 2 T: for j = 1 N: δ j t = max δ i t 1 a ij b j (O t ) i Take max δ i T i Runtime: O(N 2 T) 16

Viterbi Algorithm for LC-CRFs Create trellis δ i (t) for i = 1 N, t = 1 T δ j 1 = exp k θ init k f init k (y 1 = j, x 1 )for j = 1 N for t = 2 T: for j = 1 N: Take max i δ j t = max i δ i T δ i t 1 exp k θ k f k (y t = j, y t 1, x t ) Runtime: O(N 2 T) Remember that we need backpointers. 17

Training LC-CRFs Unlike for HMMs, no analytic MLE solution Use iterative method to improve data likelihood Gradient descent A version of Newton s method to find where the gradient is 0 18

Convexity Fortunately, l θ is a concave function (equivalently, its negation is a convex function). That means that we will find the global maximum of l θ with gradient descent. 19

Gradient of Log-Likelihood Find the gradient of the log likelihood of the training corpus: l θ = log i P(Y i X i ) 20

Regularization To avoid overfitting, we can encourage the weights to be close to zero. Add term to corpus log likelihood: l θ = log i P(Y i X i ) exp σ controls the amount of regularization k θ k 2 2σ 2 Results in extra term in gradient: θ k σ 2 21

Gradient Ascent Walk in the direction of the gradient to maximize l(θ) a.k.a., gradient descent on a loss function θ new = θ old + γ l(θ) γis a learning rate that specifies how large a step to take. There are more sophisticated ways to do this update: Conjugate gradient L-BFGS (approximates using second derivative) 22

Gradient Descent Summary Initialize θ = θ 1, θ 2,, θ k Do for a while: randomly Compute l(θ), which will require dynamic programming (i.e., forward algorithm) θ θ + γ l(θ) 23

Interpretation of Gradient Overall gradient is the difference between: and: i t f k (y i i t, y t 1, x i t ) the empirical distribution of feature f k in the training corpus i t y,y f k y, y, x t i P(y, y X i ) the expected distribution of f k as predicted by the current model 24

Interpretation of Gradient When the corpus likelihood is maximized, the gradient is zero, so the difference is zero. Intuitively, this means that finding MLE by gradient descent is equivalent to telling our model to predict the features in such a way that they are found in the same distribution as in the gold standard. 25

Stochastic Gradient Descent In the standard version of the algorithm, the gradient is computed over the entire training corpus. Weight update only once per iteration through training corpus. Alternative: calculate gradient over a small mini-batch of the training corpus and update weights then Many weight updates per iteration through training corpus Usually results in much faster convergence to final solution, without loss in performance 26

Stochastic Gradient Descent Initialize θ = θ 1, θ 2,, θ k Do for a while: randomly Randomize order of samples in training corpus For each mini-batch in the training corpus: Compute l(θ) over this mini-batch θ θ + γ l(θ) 27