CSE 490 U Natural Language Processing Spring 2016

Similar documents
CSE 447/547 Natural Language Processing Winter 2018

CSE 517 Natural Language Processing Winter2015

Graphical models for part of speech tagging

Natural Language Processing Winter 2013

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

lecture 6: modeling sequences (final part)

A brief introduction to Conditional Random Fields

Conditional Random Field

Natural Language Processing

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Probabilistic Models for Sequence Labeling

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Undirected Graphical Models

Intelligent Systems (AI-2)

CSE 490 U Natural Language Processing Spring 2016

Lecture 13: Structured Prediction

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Intelligent Systems (AI-2)

Neural Networks in Structured Prediction. November 17, 2015

Conditional Random Fields for Sequential Supervised Learning

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Machine Learning for Structured Prediction

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

CSE 447 / 547 Natural Language Processing Winter 2018

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Sequential Supervised Learning

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Conditional Random Fields

CS388: Natural Language Processing Lecture 4: Sequence Models I

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Log-Linear Models, MEMMs, and CRFs

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Dynamic Approaches: The Hidden Markov Model

Neural Architectures for Image, Language, and Speech Processing

Conditional Random Fields: An Introduction

Statistical Methods for NLP

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Sequence Labeling: HMMs & Structured Perceptron

Statistical NLP Spring A Discriminative Approach

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

Structured Prediction with Perceptrons and CRFs

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

Statistical methods in NLP, lecture 7 Tagging and parsing

Probabilistic Context Free Grammars. Many slides from Michael Collins

Lecture 5 Neural models for NLP

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Probabilistic Graphical Models

Random Field Models for Applications in Computer Vision

Logistic Regression & Neural Networks

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Applied Natural Language Processing

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Naïve Bayes, Maxent and Neural Models

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

From perceptrons to word embeddings. Simon Šuster University of Groningen

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

STA 4273H: Statistical Machine Learning

CSE446: Clustering and EM Spring 2017

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Lecture 7: Sequence Labeling

Statistical Methods for NLP

Statistical Methods for NLP

Probabilistic Graphical Models

Natural Language Processing

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

STA 414/2104: Machine Learning

Introduction to Logistic Regression and Support Vector Machine

CRF for human beings

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Hidden Markov Models

Midterm sample questions

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Administrivia. What is Information Extraction. Finite State Models. Graphical Models. Hidden Markov Models (HMMs) for Information Extraction

Lecture 12: Algorithms for HMMs

Processing/Speech, NLP and the Web

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Information Extraction from Text

Bayesian Machine Learning - Lecture 7

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

4 : Exact Inference: Variable Elimination

with Local Dependencies

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

CS395T: Structured Models for NLP Lecture 5: Sequence Models II

Brief Introduction of Machine Learning Techniques for Content Analysis

Feature-based Discriminative Models. More Sequence Models

Structure Learning in Sequential Data

Machine Learning for NLP

Lecture 15. Probabilistic Models on Graph

NEURAL LANGUAGE MODELS

Lecture 9: PGM Learning

STA 4273H: Statistical Machine Learning

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Lecture 12: Algorithms for HMMs

CPSC 540: Machine Learning

Transcription:

CSE 490 U Natural Language Processing Spring 2016 Feature Rich Models Yejin Choi - University of Washington [Many slides from Dan Klein, Luke Zettlemoyer]

Structure in the output variable(s)? What is the input representation? Generative models (classical probabilistic models) Log-linear models (discriminatively trained feature-rich models) Neural network models (representation learning) No Structure Naïve Bayes Perceptron Maximum Entropy Logistic Regression Feedforward NN CNN Structured Inference HMMs PCFGs IBM Models MEMM CRF RNN LSTM GRU

Feature Rich Models Throw anything (features) you want into the stew (the model) Log-linear models Often lead to great performance. (sometimes even a best paper award) "11,001 New Features for Statistical Machine Translation", D. Chiang, K. Knight, and W. Wang, NAACL, 2009.

Why want richer features? POS tagging: more information about the context? Is previous word the? Is previous word the and the next word of? Is previous word capitalized and the next word is numeric? Is there a word program within [-5,+5] window? Is the current word part of a known idiom? Conjunctions of any of above? Desiderata: Lots and lots of features like above: > 200K No independence assumption among features Classical probability models, however Permit very small amount of features Make strong independence assumption among features

HMMs: P(tag sequence sentence) We want a model of sequences y and observations x y 0 y 1 y 2 y n y n+1 x 1 x 2 x n ny p(x 1...x n,y 1...y n+1 )=q(stop y n ) q(y i y i 1 )e(x i y i ) i=1 where y 0 =START and we call q(y y) the transition distribution and e(x y) the emission (or observation) distribution. Assumptions: Tag/state sequence is generated by a markov model Words are chosen independently, conditioned only on the tag/state These are totally broken assumptions: why?

S NP VP 1.0 VP tvi 2 = VP0.4 VP Vt NP 0.4 NP Vt VP VP PP 0.2 1.0 NP DT NN DT 0.3 NP NP PP 0.7 PP P NP 1.0 PCFG Example PCFGs: P(parse tree sentence) NP Probability of a tree t with rules is p(t) = S VP NN Vi sleeps 1.0 Vt saw 1.0 NNPP man 0.7 NN NP woman 0.2 NN 0.5 telescope 0.1 DT NN DT the 1.0 IN with 0.5 IN in 0.5 The man saw the woman with the telescope p(t s )=1.8*0.3*1.0*0.7*0.2*0.4*1.0*0.3*1.0*0.2*0.4*0.5*0.3*1.0*0.1 α 1 β 1, α 2 β 2,...,α n β n n i=1 q(α i β i ) where q(α β) is the probability for rule α β. 1.0 0.2 0.4 0.4 IN 0.3 0.3 0.3 1.0 0.7 1.0 0.2 1.0 0.1

Rich features for long range dependencies What s different between basic PCFG scores here? What (lexical) correlations need to be scored?

p(x 1...x n )= LMs: P(text) Generative process: (1) generate the very first word conditioning on the special symbol START, then, (2) pick the next word conditioning on the previous word, then repeat (2) until the special word STOP gets picked. Graphical Model: ny q(x i x i 1 ) where i=1 X x i 2V q(x i x i 1 )=1 x 0 = START & V := V [ {STOP} START x 1 x 2 x n-1 STOP Subtleties: If we are introducing the special START symbol to the model, then we are making the assumption that the sentence always starts with the special start word START, thus when we talk about it is in fact p(x 1...x n ) p(x 1...x n x 0 = START) While we add the special STOP symbol to the vocabulary V, we do not add the special START symbol to the vocabulary. Why?

Internals of probabilistic models: nothing but adding log-prob LM: + log p(w7 w5, w6) + log p(w8 w6, w7) + PCFG: log p(np VP S) + log p(papa NP) + log p(vp PP VP) HMM tagging: + log p(t7 t5, t6) + log p(w7 t7) + Noisy channel: [log p(source)] + [log p(data source)] Naïve Bayes: log p(class) + log p(feature1 Class) + log p(feature2 Class)

arbitrary scores instead of log probs? Change log p(this that) to Φ(this ; that) LM: + Φ (w7 ; w5, w6) + Φ (w8 ; w6, w7) + PCFG: Φ (NP VP ; S) + Φ (Papa ; NP) + Φ (VP PP ; VP) HMM tagging: + Φ (t7 ; t5, t6) + Φ (w7 ; t7) + Noisy channel: [ Φ (source)] + [ Φ (data ; source)] Naïve Bayes: Φ (Class) + Φ (feature1 ; Class) + Φ (feature2 ; Class)

arbitrary scores instead of log probs? Change log p(this that) to Φ(this ; that) LM: + Φ (w7 ; w5, w6) + Φ (w8 ; w6, w7) + PCFG: Φ (NP VP ; S) + Φ (Papa ; NP) + Φ (VP PP ; VP) HMM tagging: + Φ (t7 ; t5, t6) + Φ (w7 ; t7) + MEMM or CRF Noisy channel: [ Φ (source)] + [ Φ (data ; source)] Naïve Bayes: Φ (Class) + Φ (feature1 ; Class) + Φ (feature2 ; Class) logistic regression / max-ent

Running example: POS tagging Roadmap of (known / unknown) accuracies: Strawman baseline: Most freq tag: ~90% / ~50% Generative models: Trigram HMM: ~95% / ~55% TnT (HMM++): 96.2% / 86.0% (with smart UNK ing) Feature-rich models? Upper bound: ~98%

Structure in the output variable(s)? What is the input representation? Generative models (classical probabilistic models) Log-linear models (discriminatively trained feature-rich models) Neural network models (representation learning) No Structure Naïve Bayes Perceptron Maximum Entropy Logistic Regression Feedforward NN CNN Structured Inference HMMs PCFGs IBM Models MEMM CRF RNN LSTM GRU

Rich features for rich contextual information Throw in various features about the context: Is previous word the and the next word of? Is previous word capitalized and the next word is numeric? Frequencies of the within [-15,+15] window? Is the current word part of a known idiom? You can also define features that look at the output Y! Is previous word the and the next tag is IN? Is previous word the and the next tag is NN? Is previous word the and the next tag is VB? You can also take any conjunctions of above. f(x, y) =[0, 0, 0, 1, 0, 0, 0, 0, 3, 0.2, 0, 0,...] Create a very long feature vector with dimensions often >200K Overlapping features are fine no independence assumption among features

Maximum Entropy (MaxEnt) Models Output: y One POS tag for one word (at a time) Input: x (any words in the context) Represented as a feature vector f(x, y) Model parameters: w Make probability using SoftMax function: Also known as Log-linear Models (linear if you take log) x 2 y 3 x 3 x 4 p(y x) = exp(w f(x, y)) P y exp(w f(x, y 0 )) 0 Make positive! Normalize!

Training MaxEnt Models Make probability using SoftMax function p(y x) = exp(w f(x, y)) P y exp(w f(x, y 0 )) 0 Training: maximize log likelihood of training data {(x i,y i )} n i=1 L(w) =log Y i p(y i x i )= X i log exp(w f(x i,y i )) P y 0 exp(w f(x i,y 0 )) which also incidentally maximizes the entropy (hence maximum entropy )

Training MaxEnt Models Make probability using SoftMax function p(y x) = exp(w f(x, y)) P y 0 exp(w f(x, y 0 )) Training: maximize log likelihood L(w) =log Y p(y i x i )= X exp(w f(x i,y i )) log P i i y exp(w f(x i,y 0 )) 0 = X w f(x i,y i ) log X exp(w f(x i,y 0 )) i y 0

Training MaxEnt Models L(w) = X i w f(x i,y i ) log X exp(w f(x i,y 0 )) y 0 Take partial derivative for each in the weight vector w: @L(w) @w k = X i f k (x i,y i ) w k X p(y 0 x i )f k (x i,y 0 )) y 0 Total count of feature k with respect to the correct predictions Expected count of feature k with respect to the predicted output

Convex Optimization for Training The likelihood function is convex. (can get global optimum) Many optimization algorithms/software available. Gradient ascent (descent), Conjugate Gradient, L-BFGS, etc All we need are: (1) evaluate the function at current w (2) evaluate its derivative at current w

Graphical Representation of MaxEnt p(y x) = exp(w f(x, y)) P y 0 exp(w f(x, y 0 )) Output Y Input x 1 x 2 xn

Graphical Representation of Naïve Bayes p(x y) = Y j p(x j y) Output Y Input x 1 x 2 xn

Y Y Naïve Bayes x 1 x 2 x n MaxEnt x 1 x 2 x n Naïve Bayes Classifier Generative models è p(input output) è For instance, for text categorization, P(words category) è Unnecessary efforts on generating input Maximum Entropy Classifier Discriminative models è p(output input) è For instance, for text categorization, P(category words) è Focus directly on predicting the output è Independent assumption among input variables: Given the category, each word is generated independently from other words (too strong assumption in reality!) è Cannot incorporate arbitrary/redundant/overlapping features è By conditioning on the entire input, we don t need to worry about the independent assumption among input variables è Can incorporate arbitrary features: redundant and overlapping features

Overview: POS tagging Accuracies Roadmap of (known / unknown) accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (HMM++): 96.2% / 86.0% Maxent P(s i x): 96.8% / 86.8% Q: what s missing in MaxEnt compared to HMM? Upper bound: ~98%

Structure in the output variable(s)? What is the input representation? Generative models (classical probabilistic models) Log-linear models (discriminatively trained feature-rich models) Neural network models (representation learning) No Structure Naïve Bayes Perceptron Maximum Entropy Logistic Regression Feedforward NN CNN Structured Inference HMMs PCFGs IBM Models MEMM CRF RNN LSTM GRU

MEMM Taggers One step up: also condition on previous tags p(s 1...s m x 1...x m )= Train up p(s i s i-1,x 1...x m ) as a discrete log-linear (maxent) model, then use to score sequences This is referred to as an MEMM tagger [Ratnaparkhi 96] Beam search effective! (Why?) my p(s i s 1...s i 1,x 1...x m ) i=1 What s the advantage of beam size 1? = my p(s i s i 1,x 1...x m ) i=1 p(s i s i 1,x 1...x m )= exp (w (x 1...x m,i,s i 1,s i )) P s 0 exp (w (x 1...x m,i,s i 1,s 0 ))

NNP VBZ VBN TO VB NR HMM Secretariat is expected to race tomorrow NNP VBZ VBN TO VB NR MEMM Secretariat is expected to race tomorrow HMM MEMM Generative models è joint probability p( words, tags ) è generate input (in addition to tags) è but we need to predict tags, not words! Probability of each slice = emission * transition = p(word_i tag_i) * p(tag_i tag_i-1) = è Cannot incorporate long distance features Discriminative or Conditional models è conditional probability p( tags words) è condition on input è Focusing only on predicting tags Probability of each slice = p( tag_i tag_i-1, word_i) or p( tag_i tag_i-1, all words) è Can incorporate long distance features

The HMM State Lattice / Trellis (repeat slide) ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ START Fed raises interest rates STOP e(fed N) e(raises V) e(interest V) e(rates J) q(v V) e(stop V)

The MEMM State Lattice / Trellis ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ ^ N V J D $ x = START Fed raises interest rates STOP p(v V,x)

Decoding: Decoding maxent taggers: Just like decoding HMMs p(s 1...s m x 1...x m )= Viterbi, beam search, posterior decoding Viterbi algorithm (HMMs): Define π(i,s i ) to be the max score of a sequence of length i ending in tag s i (i, s i ) = max e(x i s i )q(s i s i 1 ) (i 1,s i 1 ) s i 1 Viterbi algorithm (Maxent): my p(s i s i 1,x 1...x m ) Can use same algorithm for MEMMs, just need to redefine π(i,s i )! (i, s i ) = max s i 1 p(s i s i 1,x 1...x m ) (i 1,s i 1 ) i=1

Overview: Accuracies Roadmap of (known / unknown) accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (HMM++): 96.2% / 86.0% Maxent P(s i x): 96.8% / 86.8% MEMM tagger: 96.9% / 86.9% Upper bound: ~98%

Structure in the output variable(s)? What is the input representation? Generative models (classical probabilistic models) Log-linear models (discriminatively trained feature-rich models) Neural network models (representation learning) No Structure Naïve Bayes Perceptron Maximum Entropy Logistic Regression Feedforward NN CNN Structured Inference HMMs PCFGs IBM Models MEMM CRF RNN LSTM GRU

MEMM v.s. CRF (Conditional Random Fields) MEMM NNP VBZ VBN TO VB NR Secretariat is expected to race tomorrow CRF NNP VBZ VBN TO VB NR Secretariat is expected to race tomorrow

Graphical Models Y1 Y2 Y3 X1 X2 X3 Conditional probability for each node e.g. p( Y3 Y2, X3 ) for Y3 e.g. p( X3 ) for X3 Conditional independence e.g. p( Y3 Y2, X3 ) = p( Y3 Y1, Y2, X1, X2, X3) Joint probability of the entire graph = product of conditional probability of each node

Undirected Graphical Model Basics Y1 Y2 Y3 X1 X2 X3 Conditional independence e.g. p( Y3 all other nodes ) = p( Y3 Y3 neighbor ) No conditional probability for each node Instead, potential function for each clique e.g. φ ( X1, X2, Y1 ) or φ ( Y1, Y2 ) Typically, log-linear potential functions è φ ( Y1, Y2 ) = exp Σk wk fk (Y1, Y2)

Undirected Graphical Model Basics Y1 Y2 Y3 X1 X2 X3 Joint probability of the entire graph P(Y ) = 1 Z Z = Y clique C clique C ϕ(y C ) ϕ(y C )

MEMM NNP VB Z VBN TO VB NR Secretariat is expected to race tomorrow CRF NNP VB Z VBN TO VB NR Secretariat is expected to race tomorrow MEMM Directed graphical model CRF Undirected graphical model Discriminative or Conditional models è conditional probability p( tags words) Probability is defined for each slice = P ( tag_i tag_i-1, word_i) or p ( tag_i tag_i-1, all words) Instead of probability, potential (energy function) is defined for each slide = φ ( tag_i, tag_i-1 ) * φ (tag_i, word_i) or φ ( tag_i, tag_i-1, all words ) * φ (tag_i, all words) è Can incorporate long distance features

Conditional Random Fields (CRFs) Maximum entropy (logistic regression) [Lafferty, McCallum, Pereira 01] Sentence: x=x 1 x m Tag Sequence: s=s 1 s m p(s x; w) = exp (w (x, s)) P s 0 exp (w (x, s 0 )) Learning: maximize the (log) conditional likelihood of training data @ @w j L(w) = {(x i,y i )} n i=1 nx i=1 j(x i,s i ) Computational Challenges?! X p(s x i ; w) j (x i,s) Most likely tag sequence, normalization constant, gradient s w j

CRFs Decoding s = arg max s p(s x; w) Features must be local, for x=x 1 x m, and s=s 1 s m p(s x; w) = exp (w (x, s)) P s 0 exp (w (x, s 0 )) (x, s) = mx j=1 (x, j, s j 1,s j ) arg max s exp (w (x, s)) Ps exp (w (x, s 0 )) 0 = arg max s exp (w (x, s)) = arg max s Viterbi recursion w (x, s) (i, s i ) = max s i 1 (x, i, s i 1,s i )+ (i 1,s i 1 )

CRFs: Computing Normalization* p(s x; w) = X s 0 exp w (x, s 0 ) exp (w (x, s)) P s exp (w (x, s 0 )) 0 = X exp @ X s 0 j = X Y exp (w (x, j, s j 1,s j )) s 0 j 0 (x, s) = w (x, j, s j 1,s j ) A j=1 1 Define norm(i,s i ) to sum of scores for sequences ending in position i norm(i, y i )= X s i 1 exp (w (x, i, s i 1,s i )) norm(i 1,s i 1 ) mx (x, j, s j 1,s j ) Forward Algorithm! Remember HMM case: (i, y i )= X y i 1 e(x i y i )q(y i y i 1 ) (i 1,y i 1 ) Could also use backward?

p(s x; w) = CRFs: Computing Gradient* @ @w j L(w) = exp (w (x, s)) P s exp (w (x, s 0 )) 0 nx i=1 X p(s x i ; w) s = mx X j=1 a,b Need forward and backward messages See notes for full details! j(x i,s i )! X p(s x i ; w) j (x i,s) s j (x i,s) = X s X s:s j 1 =a,s b =b (x, s) = p(s x i ; w) mx j=1 mx j=1 p(s x i ; w) k (x i,j,s j 1,s j ) (x, j, s j 1,s j ) w j k(x i,j,s j 1,s j )

Overview: Accuracies Roadmap of (known / unknown) accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (HMM++): 96.2% / 86.0% Maxent P(s i x): 96.8% / 86.8% MEMM tagger: 96.9% / 86.9% CRF (untuned) 95.7% / 76.2% Upper bound: ~98%

Cyclic Network [Toutanova et al 03] Train two MEMMs, multiple together to score And be very careful Tune regularization Try lots of different features See paper for full details (a) Left-to-Right CMM (b) Right-to-Left CMM (c) Bidirectional Dependency Network

Overview: Accuracies Roadmap of (known / unknown) accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% TnT (HMM++): 96.2% / 86.0% Maxent P(s i x): 96.8% / 86.8% MEMM tagger: 96.9% / 86.9% Perceptron 96.7% /?? CRF (untuned) 95.7% / 76.2% Cyclic tagger: 97.2% / 89.0% Upper bound: ~98%

Locally normalized models HMMs, MEMMs Local scores are probabilities However: one issue in local models Label bias and other explaining away effects MEMM taggers local scores can be near one without having both good transitions and emissions This means that often evidence doesn t flow properly Why isn t this a big deal for POS tagging? Globally normalized models Local scores are arbitrary scores Conditional Random Fields (CRFs) Slower to train (structured inference at each iteration of learning) Neural Networks (global training w/o structured inference)

Structure in the output variable(s)? What is the input representation? Generative models (classical probabilistic models) Log-linear models (discriminatively trained feature-rich models) Neural network models (representation learning) No Structure Naïve Bayes Perceptron Maximum Entropy Logistic Regression Feedforward NN CNN Structured Inference HMMs PCFGs IBM Models MEMM CRF RNN LSTM GRU