CSE 447/547 Natural Language Processing Winter 2018

Similar documents
CSE 490 U Natural Language Processing Spring 2016

CSE 517 Natural Language Processing Winter2015

Graphical models for part of speech tagging

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Natural Language Processing Winter 2013

lecture 6: modeling sequences (final part)

Intelligent Systems (AI-2)

Intelligent Systems (AI-2)

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Natural Language Processing

A brief introduction to Conditional Random Fields

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Conditional Random Field

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Lecture 13: Structured Prediction

Neural Networks in Structured Prediction. November 17, 2015

Probabilistic Models for Sequence Labeling

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Undirected Graphical Models

Conditional Random Fields for Sequential Supervised Learning

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

CSE 490 U Natural Language Processing Spring 2016

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Machine Learning for Structured Prediction

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Sequential Supervised Learning

Log-Linear Models, MEMMs, and CRFs

Dynamic Approaches: The Hidden Markov Model

CSE 447 / 547 Natural Language Processing Winter 2018

CS388: Natural Language Processing Lecture 4: Sequence Models I

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Conditional Random Fields

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Statistical Methods for NLP

Probabilistic Graphical Models

Naïve Bayes, Maxent and Neural Models

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Statistical methods in NLP, lecture 7 Tagging and parsing

Statistical NLP Spring A Discriminative Approach

Neural Architectures for Image, Language, and Speech Processing

Conditional Random Fields: An Introduction

Structured Prediction with Perceptrons and CRFs

Probabilistic Context Free Grammars. Many slides from Michael Collins

Applied Natural Language Processing

Sequence Labeling: HMMs & Structured Perceptron

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Logistic Regression & Neural Networks

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Bayesian Machine Learning - Lecture 7

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

From perceptrons to word embeddings. Simon Šuster University of Groningen

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

Lecture 5 Neural models for NLP

Random Field Models for Applications in Computer Vision

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Statistical Methods for NLP

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Introduction to Logistic Regression and Support Vector Machine

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

STA 4273H: Statistical Machine Learning

Lecture 9: PGM Learning

Probabilistic Graphical Models

Lecture 7: Sequence Labeling

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Statistical Methods for NLP

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

STA 414/2104: Machine Learning

Natural Language Processing

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Midterm sample questions

Hidden Markov Models

Machine Learning for NLP

CS838-1 Advanced NLP: Hidden Markov Models

9 Forward-backward algorithm, sum-product on factor graphs

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Structure Learning in Sequential Data

Lecture 12: Algorithms for HMMs

Lecture 15. Probabilistic Models on Graph

STA 4273H: Statistical Machine Learning

Processing/Speech, NLP and the Web

4 : Exact Inference: Variable Elimination

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

CRF for human beings

Lecture 6: Graphical Models

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

CSE446: Clustering and EM Spring 2017

Probabilistic Graphical Models

Pattern Recognition and Machine Learning

Administrivia. What is Information Extraction. Finite State Models. Graphical Models. Hidden Markov Models (HMMs) for Information Extraction

CPSC 540: Machine Learning

Hidden Markov Models

Basic math for biology

Transcription:

CSE 447/547 Natural Language Processing Winter 2018 Feature Rich Models (Log Linear Models) Yejin Choi University of Washington [Many slides from Dan Klein, Luke Zettlemoyer]

Announcements HW #3 Due Feb 16 Fri? Feb 19 Mon? Feb 5 guest lecture by Max Forbes! VerbPhysics (using a factor graph model) Related models: Conditional Random Fields, Markov Random Fields, log-linear models Related algorithms: belief propagation, sumproduct algorithm, forward backward 2

Goals of this Class How to construct a feature vector f(x) How to extend the feature vector to f(x,y) How to construct a probability model using any given f(x,y) How to learn the parameter vector w for MaxEnt (log-linear) models Knowing the key differences between MaxEnt and Naïve Bayes How to extend MaxEnt to sequence tagging 3

Structure in the output variable(s)? What is the input representation? Generative models (classical probabilistic models) Log-linear models (discriminatively trained feature-rich models) Neural network models (representation learning) No Structure Naïve Bayes Perceptron Maximum Entropy Logistic Regression Feedforward NN CNN Structured Inference HMMs PCFGs IBM Models MEMM CRF RNN LSTM GRU

Feature Rich Models Throw anything (features) you want into the stew (the model) Log-linear models Often lead to great performance. (sometimes even a best paper award) "11,001 New Features for Statistical Machine Translation", D. Chiang, K. Knight, and W. Wang, NAACL, 2009.

Why want richer features? POS tagging: more information about the context? Is previous word the? Is previous word the and the next word of? Is previous word capitalized and the next word is numeric? Is there a word program within [-5,+5] window? Is the current word part of a known idiom? Conjunctions of any of above? Desiderata: Lots and lots of features like above: > 200K No independence assumption among features Classical probability models, however Permit very small amount of features Make strong independence assumption among features

HMMs: P(tag sequence sentence) We want a model of sequences y and observations x y 0 y 1 y 2 y n y n+1 x 1 x 2 x n ny p(x 1...x n,y 1...y n+1 )=q(stop y n ) q(y i y i 1 )e(x i y i ) i=1 where y 0 =START and we call q(y y) the transition distribution and e(x y) the emission (or observation) distribution. Assumptions: Tag/state sequence is generated by a markov model Words are chosen independently, conditioned only on the tag/state These are totally broken assumptions: why?

PCFGs: P(parse PCFG Example tree sentence) S NP VP 1.0 VP tvi 2 = VP0.4 VP Vt NP 0.4 NP Vt VP VP PP 0.2 1.0 NP DT NN DT 0.3 NP NP PP 0.7 PP P NP 1.0 NP Probability of a tree t with rules is p(t) = S NN Vi sleeps 1.0 Vt saw 1.0 NNPP man 0.7 NN NP woman 0.2 NN 0.5 telescope 0.1 DT NN DT the 1.0 IN with 0.5 IN in 0.5 The man saw the woman with the telescope α 1 β 1, α 2 β 2,...,α n β n n i=1 q(α i β i ) where q(α β) is the probability for rule α β. 1.0 VP p(t s )=1.8*0.3*1.0*0.7*0.2*0.4*1.0*0.3*1.0*0.2*0.4*0.5*0.3*1.0*0.1 0.2 0.4 0.4 IN 0.3 0.3 0.3 1.0 0.7 1.0 0.2 1.0 0.1

Rich features for long range dependencies What s different between basic PCFG scores here? What (lexical) correlations need to be scored?

p(x 1...x n )= LMs: P(text) Generative process: (1) generate the very first word conditioning on the special symbol START, then, (2) pick the next word conditioning on the previous word, then repeat (2) until the special word STOP gets picked. Graphical Model: ny q(x i x i 1 ) where i=1 X x i 2V q(x i x i 1 )=1 x 0 = START & V := V [ {STOP} START x 1 x 2 x n-1 STOP Subtleties: If we are introducing the special START symbol to the model, then we are making the assumption that the sentence always starts with the special start word START, thus when we talk about it is in fact p(x 1...x n ) p(x 1...x n x 0 = START) While we add the special STOP symbol to the vocabulary, we do not add the V special START symbol to the vocabulary. Why?

Internals of probabilistic models: nothing but adding log-prob LM: + log p(w7 w5, w6) + log p(w8 w6, w7) + PCFG: log p(np VP S) + log p(papa NP) + log p(vp PP VP) HMM tagging: + log p(t7 t5, t6) + log p(w7 t7) + Noisy channel: [log p(source)] + [log p(data source)] Naïve Bayes: log p(class) + log p(feature1 Class) + log p(feature2 Class)

arbitrary scores instead of log probs? Change log p(this that) to Φ(this ; that) LM: + Φ (w7 ; w5, w6) + Φ (w8 ; w6, w7) + PCFG: Φ (NP VP ; S) + Φ (Papa ; NP) + Φ (VP PP ; VP) HMM tagging: + Φ (t7 ; t5, t6) + Φ (w7 ; t7) + Noisy channel: [ Φ (source)] + [ Φ (data ; source)] Naïve Bayes: Φ (Class) + Φ (feature1 ; Class) + Φ (feature2 ; Class)

arbitrary scores instead of log probs? Change log p(this that) to Φ(this ; that) LM: + Φ (w7 ; w5, w6) + Φ (w8 ; w6, w7) + PCFG: Φ (NP VP ; S) + Φ (Papa ; NP) + Φ (VP PP ; VP) HMM tagging: + Φ (t7 ; t5, t6) + Φ (w7 ; t7) + MEMM or CRF Noisy channel: [ Φ (source)] + [ Φ (data ; source)] Naïve Bayes: Φ (Class) + Φ (feature1 ; Class) + Φ (feature2 ; Class) logistic regression / max-ent

Running example: POS tagging Roadmap of (known / unknown) accuracies: Strawman baseline: Most freq tag: ~90% / ~50% Generative models: Trigram HMM: ~95% / ~55% TnT (HMM++): 96.2% / 86.0% (with smart UNK ing) Feature-rich models? Upper bound: ~98%

Structure in the output variable(s)? What is the input representation? Generative models (classical probabilistic models) Log-linear models (discriminatively trained feature-rich models) Neural network models (representation learning) No Structure Naïve Bayes Perceptron Maximum Entropy Logistic Regression Feedforward NN CNN Structured Inference HMMs PCFGs IBM Models MEMM CRF RNN LSTM GRU

Rich features for rich contextual information Throw in various features about the context: f1 := Is previous word the and the next word of? f2 := Is previous word capitalized and the next word is numeric? f3 := Frequencies of the within [-15,+15] window? f4 := Is the current word part of a known idiom? given a sentence the blah the truth of the blah Let s say x = truth above, then f(x) := (f1, f2, f3, f4) f(truth) = (true, false, 3, false) => f(x) = (1, 0, 3, 0)

Rich features for rich contextual information Throw in various features about the context: f1 := Is previous word the and the next word of? f2 := You can also define features that look at the output y! f1_n := Is previous word the and the next tag is N? f2_n := f1_v := Is previous word the and the next tag is V?. (replicate all features with respect to different values of y) f(x) := (f1, f2, f3, f4) f(x,y) := (f1_n, f2_n, f3_n, f4_n, f1_v, f2_v, f3_v, f4_v, f1_d, f2_d, f3_d, f4_d,.)

Rich features for rich contextual information You can also define features that look at the output y! f1_n := Is previous word the and the next tag is N? f2_n := f1_v := Is previous word the and the next tag is V?. (replicate all features with respect to different values of y) given a sentence the blah the truth of the blah Let s say x = truth above, and y = N, then f(truth) = (true, false, 3, false) f(x,y) := (f1_n, f2_n, f3_n, f4_n, f(truth, N) =? f1_v, f2_v, f3_v, f4_v, f1_d, f2_d, f3_d, f4_d,.)

Rich features for rich contextual information Throw in various features about the context: f1 := Is previous word the and the next word of? f2 := Is previous word capitalized and the next word is numeric? f3 := Frequencies of the within [-15,+15] window? f4 := Is the current word part of a known idiom? You can also define features that look at the output y! f1_n := Is previous word the and the next tag is N? f1_v := Is previous word the and the next tag is V? You can also take any conjunctions of above. f(x, y) =[0, 0, 0, 1, 0, 0, 0, 0, 3, 0.2, 0, 0,...] Create a very long feature vector with dimensions often >200K Overlapping features are fine no independence assumption among features

Goals of this Class How to construct a feature vector f(x) How to extend the feature vector to f(x,y) How to construct a probability model using any given f(x,y) How to learn the parameter vector w for MaxEnt (log-linear) models Knowing the key differences between MaxEnt and Naïve Bayes How to extend MaxEnt to sequence tagging 20

Maximum Entropy (MaxEnt) Models Output: y One POS tag for one word (at a time) Input: x (any words in the context) Represented as a feature vector f(x, y) Model parameters: w Make probability using SoftMax function: Also known as Log-linear Models (linear if you take log) x 2 y 3 x 3 x 4 p(y x) = exp(w f(x, y)) P y exp(w f(x, y 0 )) 0 Make positive! Normalize!

Training MaxEnt Models Make probability using SoftMax function p(y x) = Training: exp(w f(x, y)) P y exp(w f(x, y 0 )) 0 maximize log likelihood of training data {(x i,y i )} n i=1 L(w) =log Y i p(y i x i )= X i log exp(w f(x i,y i )) P y 0 exp(w f(x i,y 0 )) which also incidentally maximizes the entropy (hence maximum entropy )

Training MaxEnt Models Make probability using SoftMax function p(y x) = Training: exp(w f(x, y)) P y 0 exp(w f(x, y 0 )) maximize log likelihood L(w) =log Y p(y i x i )= X exp(w f(x i,y i )) log P i i y exp(w f(x i,y 0 )) 0 = X w f(x i,y i ) log X exp(w f(x i,y 0 )) i y 0

Training MaxEnt Models L(w) = X i w f(x i,y i ) log X exp(w f(x i,y 0 )) y 0 Take partial derivative for each in the weight vector w: @L(w) @w k = X i f k (x i,y i ) w k X p(y 0 x i )f k (x i,y 0 )) y 0 Total count of feature k with respect to the correct predictions Expected count of feature k with respect to the predicted output

Convex Optimization for Training The likelihood function is convex. (can get global optimum) Many optimization algorithms/software available. Gradient ascent (descent), Conjugate Gradient, L-BFGS, etc All we need are: (1) evaluate the function at current w (2) evaluate its derivative at current w

Goals of this Class How to construct a feature vector f(x) How to extend the feature vector to f(x,y) How to construct a probability model using any given f(x,y) How to learn the parameter vector w for MaxEnt (log-linear) models Knowing the key differences between MaxEnt and Naïve Bayes How to extend MaxEnt to sequence tagging 26

Graphical Representation of MaxEnt p(y x) = exp(w f(x, y)) P y 0 exp(w f(x, y 0 )) Output Y Input x 1 x 2 xn

Graphical Representation of Naïve Bayes p(x y) = Y j p(x j y) Output Y Input x 1 x 2 xn

Y Y Naïve Bayes x 1 x 2 x n MaxEnt x 1 x 2 x n Naïve Bayes Classifier Generative models è p(input output) è For instance, for text categorization, P(words category) è Unnecessary efforts on generating input è Independent assumption among input variables: Given the category, each word is generated independently from other words (too strong assumption in reality!) è Cannot incorporate arbitrary/redundant/overlapping features Maximum Entropy Classifier Discriminative models è p(output input) è For instance, for text categorization, P(category words) è Focus directly on predicting the output è By conditioning on the entire input, we don t need to worry about the independent assumption among input variables è Can incorporate arbitrary features: redundant and overlapping features

Overview: POS tagging Accuracies Roadmap of (known / unknown) accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (HMM++): 96.2% / 86.0% Maxent P(s i x): 96.8% / 86.8% Q: what s missing in MaxEnt compared to HMM? Upper bound: ~98%

Structure in the output variable(s)? What is the input representation? Generative models (classical probabilistic models) Log-linear models (discriminatively trained feature-rich models) Neural network models (representation learning) No Structure Naïve Bayes Perceptron Maximum Entropy Logistic Regression Feedforward NN CNN Structured Inference HMMs PCFGs IBM Models MEMM CRF RNN LSTM GRU

Goals of this Class How to construct a feature vector f(x) How to extend the feature vector to f(x,y) How to construct a probability model using any given f(x,y) How to learn the parameter vector w for MaxEnt (log-linear) models Knowing the key differences between MaxEnt and Naïve Bayes How to extend MaxEnt to sequence tagging 32

MEMM Taggers One step up: also condition on previous tags p(s 1...s m x 1...x m )= = my p(s i s 1...s i 1,x 1...x m ) i=1 my p(s i s i 1,x 1...x m ) i=1 Train up p(s i s i-1,x 1...x m ) as a discrete log-linear (maxent) model, then use to score sequences p(s i s i 1,x 1...x m )= exp (w (x 1...x m,i,s i 1,s i )) P s 0 exp (w (x 1...x m,i,s i 1,s 0 )) This is referred to as an MEMM tagger [Ratnaparkhi 96]

NNP VBZ VBN TO VB NR HMM Secretariat is expected to race tomorrow NNP VBZ VBN TO VB NR MEMM Secretariat is expected to race tomorrow HMM MEMM Generative models è joint probability p( words, tags ) è generate input (in addition to tags) è but we need to predict tags, not words! Probability of each slice = emission * transition = p(word_i tag_i) * p(tag_i tag_i-1) = è Cannot incorporate long distance features Discriminative or Conditional models è conditional probability p( tags words) è condition on input è Focusing only on predicting tags Probability of each slice = p( tag_i tag_i-1, word_i) or p( tag_i tag_i-1, all words) è Can incorporate long distance features

The HMM State Lattice / Trellis (repeat slide) ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ START Fed raises interest rates STOP e(fed N) e(raises V) e(interest V) e(rates J) q(v V) e(stop V)

The MEMM State Lattice / Trellis ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ x = START Fed raises interest rates STOP p(v V,x)

Decoding: Decoding maxent taggers: Just like decoding HMMs p(s 1...s m x 1...x m )= Viterbi, beam search, posterior decoding Viterbi algorithm (HMMs): Define π(i,s i ) to be the max score of a sequence of length i ending in tag s i (i, s i ) = max e(x i s i )q(s i s i 1 ) (i 1,s i 1 ) s i 1 Viterbi algorithm (Maxent): my p(s i s i 1,x 1...x m ) Can use same algorithm for MEMMs, just need to redefine π(i,s i )! (i, s i ) = max s i 1 p(s i s i 1,x 1...x m ) (i 1,s i 1 ) i=1

Overview: Accuracies Roadmap of (known / unknown) accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (HMM++): 96.2% / 86.0% Maxent P(s i x): 96.8% / 86.8% MEMM tagger: 96.9% / 86.9% Upper bound: ~98%

Structure in the output variable(s)? What is the input representation? Generative models (classical probabilistic models) Log-linear models (discriminatively trained feature-rich models) Neural network models (representation learning) No Structure Naïve Bayes Perceptron Maximum Entropy Logistic Regression Feedforward NN CNN Structured Inference HMMs PCFGs IBM Models MEMM CRF RNN LSTM GRU

MEMM v.s. CRF (Conditional Random Fields) MEMM NNP VBZ VBN TO VB NR Secretariat is expected to race tomorrow CRF NNP VBZ VBN TO VB NR Secretariat is expected to race tomorrow

MEMM NNP VB Z VBN TO VB NR Secretariat is expected to race tomorrow CRF NNP VB Z VBN TO VB NR Secretariat is expected to race tomorrow MEMM Directed graphical model CRF Undirected graphical model Discriminative or Conditional models è conditional probability p( tags words) Probability is defined for each slice = P ( tag_i tag_i-1, word_i) or p ( tag_i tag_i-1, all words) Instead of probability, potential (energy function) is defined for each slide = f ( tag_i, tag_i-1 ) * f (tag_i, word_i) or f ( tag_i, tag_i-1, all words ) * f (tag_i, all words) è Can incorporate long distance features

Conditional Random Fields (CRFs) Maximum entropy (logistic regression) [Lafferty, McCallum, Pereira 01] Sentence: x=x 1 x m Tag Sequence: s=s 1 s m p(s x; w) = exp (w (x, s)) P s 0 exp (w (x, s 0 )) Learning: maximize the (log) conditional likelihood of training data {(x i,s i )} n i=1 @ @w j L(w) = nx i=1 j(x i,s i ) Computational Challenges?! X p(s x i ; w) j (x i,s) Most likely tag sequence, normalization constant, gradient s w j

CRFs Decoding s = arg max s p(s x; w) Features must be local, for x=x 1 x m, and s=s 1 s m p(s x; w) = exp (w (x, s)) P s 0 exp (w (x, s 0 )) (x, s) = mx j=1 (x, j, s j 1,s j ) arg max s exp (w (x, s)) Ps exp (w (x, s 0 )) 0 = arg max s exp (w (x, s)) = arg max s Viterbi recursion w (x, s) (i, s i ) = max si 1 w (x, i, s i 1,s i )+ (i 1,s i 1 )

CRFs: Computing Normalization* p(s x; w) = X s 0 exp w (x, s 0 ) exp (w (x, s)) P s exp (w (x, s 0 )) 0 = X exp @ X s 0 j = X Y exp (w (x, j, s j 1,s j )) s 0 j 0 (x, s) = w (x, j, s j 1,s j ) A j=1 1 Define norm(i,s i ) to sum of scores for sequences ending in position i norm(i, y i )= X s i 1 exp (w (x, i, s i 1,s i )) norm(i 1,s i 1 ) mx (x, j, s j 1,s j ) Forward Algorithm! Remember HMM case: (i, y i )= X y i 1 e(x i y i )q(y i y i 1 ) (i 1,y i 1 ) Could also use backward?

CRFs: Computing Gradient* p(s x; w) = @ @w j L(w) = exp (w (x, s)) P s exp (w (x, s 0 )) 0 nx i=1 X p(s x i ; w) s = mx X j=1 a,b Need forward and backward messages See notes for full details! j(x i,s i )! X p(s x i ; w) j (x i,s) s j (x i,s) = X s X s:s j 1 =a,s b =b (x, s) = p(s x i ; w) mx j=1 mx j=1 p(s x i ; w) k (x i,j,s j 1,s j ) (x, j, s j 1,s j ) w j k(x i,j,s j 1,s j )

Overview: Accuracies Roadmap of (known / unknown) accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (HMM++): 96.2% / 86.0% Maxent P(s i x): 96.8% / 86.8% MEMM tagger: 96.9% / 86.9% CRF (untuned) 95.7% / 76.2% Upper bound: ~98%

Cyclic Network [Toutanova et al 03] Train two MEMMs, multiple together to score And be very careful Tune regularization Try lots of different features See paper for full details (a) Left-to-Right CMM (b) Right-to-Left CMM (c) Bidirectional Dependency Network

Overview: Accuracies Roadmap of (known / unknown) accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (HMM++): 96.2% / 86.0% Maxent P(s i x): 96.8% / 86.8% MEMM tagger: 96.9% / 86.9% Perceptron 96.7% /?? CRF (untuned) 95.7% / 76.2% Cyclic tagger: 97.2% / 89.0% Upper bound: ~98%

Locally normalized models HMMs, MEMMs Local scores are probabilities However: one issue in local models Label bias and other explaining away effects MEMM taggers local scores can be near one without having both good transitions and emissions This means that often evidence doesn t flow properly Why isn t this a big deal for POS tagging? Globally normalized models Local scores are arbitrary scores Conditional Random Fields (CRFs) Slower to train (structured inference at each iteration of learning) Neural Networks (global training w/o structured inference)

Structure in the output variable(s)? What is the input representation? Generative models (classical probabilistic models) Log-linear models (discriminatively trained feature-rich models) Neural network models (representation learning) No Structure Naïve Bayes Perceptron Maximum Entropy Logistic Regression Feedforward NN CNN Structured Inference HMMs PCFGs IBM Models MEMM CRF RNN LSTM GRU

Supplementary Material

Graphical Models Y1 Y2 Y3 X1 X2 X3 Conditional probability for each node e.g. p( Y3 Y2, X3 ) for Y3 e.g. p( X3 ) for X3 Conditional independence e.g. p( Y3 Y2, X3 ) = p( Y3 Y1, Y2, X1, X2, X3) Joint probability of the entire graph = product of conditional probability of each node

Undirected Graphical Model Basics Y1 Y2 Y3 X1 X2 X3 Conditional independence e.g. p( Y3 all other nodes ) = p( Y3 Y3 neighbor ) No conditional probability for each node Instead, potential function for each clique e.g. f ( X1, X2, Y1 ) or f ( Y1, Y2 ) Typically, log-linear potential functions è f ( Y1, Y2 ) = exp Sk wk fk (Y1, Y2)

Undirected Graphical Model Basics Y1 Y2 Y3 X1 X2 X3 Joint probability of the entire graph P(Y ) = 1 Z Z = Y clique C clique C ϕ(y C ) ϕ(y C )