Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park

Similar documents
Machine Learning for Structured Prediction

Structured Prediction

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

NLP Programming Tutorial 11 - The Structured Perceptron

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Cross-Lingual Language Modeling for Automatic Speech Recogntion

The Noisy Channel Model and Markov Models

2.2 Structured Prediction

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Machine Learning for NLP

Hidden Markov Models

Natural Language Processing

A Support Vector Method for Multivariate Performance Measures

Lecture 12: Algorithms for HMMs

Log-Linear Models, MEMMs, and CRFs

Tuning as Linear Regression

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

A Comparative Study of Parameter Estimation Methods for Statistical Natural Language Processing

Hidden Markov Models

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Lecture 12: Algorithms for HMMs

6.891: Lecture 24 (December 8th, 2003) Kernel Methods

with Local Dependencies

Dual Decomposition for Natural Language Processing. Decoding complexity

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Lecture 13: Structured Prediction

Spectral Learning for Non-Deterministic Dependency Parsing

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

EXPERIMENTS ON PHRASAL CHUNKING IN NLP USING EXPONENTIATED GRADIENT FOR STRUCTURED PREDICTION

Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for Natural Language Processing

Sequential Supervised Learning

Probabilistic Graphical Models

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Learning to translate with neural networks. Michael Auli

Transition-based Dependency Parsing with Selectional Branching

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

TnT Part of Speech Tagger

Algorithms for NLP. Classifica(on III. Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Cutting Plane Training of Structural SVM

The Geometry of Statistical Machine Translation

Machine Learning for natural language processing

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

P R + RQ P Q: Transliteration Mining Using Bridge Language

Multi-Label Informed Latent Semantic Indexing

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

AN ABSTRACT OF THE DISSERTATION OF

Machine Learning for NLP

Fast Logistic Regression for Text Categorization with Variable-Length N-grams

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Lagrangian Relaxation Algorithms for Inference in Natural Language Processing

lecture 6: modeling sequences (final part)

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Lecture 5 Neural models for NLP

Lecture 11: Hidden Markov Models

Personal Project: Shift-Reduce Dependency Parsing

Lecture 3: ASR: HMMs, Forward, Viterbi

Lab 12: Structured Prediction

Natural Language Processing

CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss

Variational Decoding for Statistical Machine Translation

Fun with weighted FSTs

Online EM for Unsupervised Models

Sequence Labeling: HMMs & Structured Perceptron

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Introduction to Logistic Regression and Support Vector Machine

Estimating Covariance Using Factorial Hidden Markov Models

Naïve Bayes, Maxent and Neural Models

A Discriminative Model for Semantics-to-String Translation

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Statistical Ranking Problem

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

Sequential Data Modeling - The Structured Perceptron

Minibatch and Parallelization for Online Large Margin Structured Learning

Bowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations

Posterior Regularization

Structured Prediction with Perceptron: Theory and Algorithms

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Conditional Random Fields

Intelligent Systems (AI-2)

Margin-based Decomposed Amortized Inference

Algorithms for Predicting Structured Data

Driving Semantic Parsing from the World s Response

Supervised Learning Coursework

Ecient Higher-Order CRFs for Morphological Tagging

Linear & nonlinear classifiers

Hidden Markov Models in Language Processing

Lecture 3: Multiclass Classification

Structured Prediction Models via the Matrix-Tree Theorem

A Least Squares Formulation for Canonical Correlation Analysis

Chunking with Support Vector Machines

From perceptrons to word embeddings. Simon Šuster University of Groningen

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Transcription:

Low-Dimensional Discriminative Reranking Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park

Discriminative Reranking Useful for many NLP tasks Enables us to use arbitrary features Search space is limited

Related Work POS Tagging Collins 2002, Huang et al. 2007 Dependency Parsing Charniak and Johnson 2005, Shen et al. 2004 Machine Translation Watanabe et al. 2007 Use joint features defined on input and output pair Increases the number of features Feature engineering becomes essential

Outline Introduction Low-Dimensional Reranking Models Generative-Style Model Discriminative-Style Model Softened-Discriminative Model Experiments Conclusion

Low-dimensional Reranking 1) Run independent feature extractors x y 2) Use training data to find low-dimensional space X = [x 1,...,x n ] ; y 2 Y=[y 1, y n ] x 1 y 1 x 2 x 1 x 3 A B y 1 y 3 y 2 x 4 y 4

Ranking Models 1) Generative Style Model 2) Discriminative Style Model 3) Softened-Discriminative Model

Ranking Models 1) Generative Style Model Uses only reference outputs i.e. {x i, y i } i=1...n 2) Discriminative Style Model Uses reference and candidate outputs Ensures the highest scoring output is the reference 3) Softened-Discriminative Model {x i, y i, y ij }i=1... n Difficult to solve the Discriminative-Style Model Relaxed version which is tractable

1) Generative-Style Model Uses only input, ref output pairs Aim to find A and B such that {x i, y i }i=1... n 1 2 3 A T x Projection is nothing but and B T y y 2 x 1 y x 2 1 x 1 x 3 A B y 1 y 3 y 2 x 4 y 4

1) Generative-Style Model Uses only input, ref output pairs Aim to find a and b such that a T X Projection is nothing but and {x i, y i }i=1... n b T Y (a,b)=argmax cos( X T a, Y T b) argmax a T X Y T b s.t. X T a = Y T b =1 Similarity measured using Cosine Angle Can be solved exactly using Gen. Eigenvalue problem Use the first k eigenvectors to form A and B 1 2 3

Discriminative-Style Models Intuition Use ref. ( y 1 ) and candidate ( ŷ 1 ) outputs Aim to find A and B s.t. ref. output is highest scoring 1 2 3 y i =argmax ŷ ij cos( A T x i, B T ŷ ij ) x 1 y 1 ŷ 2 ŷ 1 ŷ 3 Similarity measured using Cosine Angle A B y 1 ŷ 2 x 1 ŷ 3 ŷ 1

Discriminative-Style Models Use ref. ( y 1 ) and candidate ( ŷ 1 ) outputs Aim to find A and B s.t. ref. output is highest scoring 1 2 3 y i =argmax ŷ ij cos( A T x i, B T ŷ ij ) x 1 y 1 ŷ 2 ŷ 1 ŷ 3 A B y 1 ŷ 2 x 1 ŷ 3 ŷ 1

Discriminative-Style Models Use ref. ( y 1 ) and candidate ( ŷ 1 ) outputs Aim to find A and B s.t. ref. output is highest scoring 1 2 3 y i =argmax ŷ ij cos( A T x i, B T ŷ ij ) 1) Observe that the score is: a T x i T yij b 2)Add constraints to penalize incorrect ones m ij =a T x i y i T b a T x i ŷ ij b m ij 1 ξ i L ij

2) Discriminative-Style Model Enforce the individual constraints 1 2 3 argmax a,b,ξ 0 1 λ λ at X Y T b i ξ i a T X X T a = 1 b T Y Y T b = 1 m ij 1 ξ i L ij

2) Discriminative-Style Model Enforce the individual constraints 1 2 3 argmax a,b,ξ 0 1 λ λ at X Y T b i ξ i a T X X T a = 1 b T Y Y T b = 1 m ij 1 ξ i L ij

2) Discriminative-Style Model Enforce the individual constraints 1 2 3 argmax a,b,ξ 0 1 λ λ at X Y T b i ξ i a T X X T a = 1 b T Y Y T b = 1 m ij 1 ξ i L ij

2) Discriminative-Style Model Enforce the individual constraints 1 2 3 argmax a,b,ξ 0 1 λ λ at X Y T b i ξ i a T X X T a = 1 b T Y Y T b = 1 m ij 1 ξ i L ij

2) Discriminative-Style Model Enforce the individual constraints 1 2 3 argmax a,b,ξ 0 1 λ λ at X Y T b i ξ i a T X X T a = 1 b T Y Y T b = 1 m ij 1 ξ i L ij Harder to solve exactly

3) Softened-Discriminative Model Discriminative Model argmax a,b, ξ 0 1 λ λ at X Y T b i ξ i 1 2 3 s.t. length constraints m ij 1 ξ i L ij Softened-Discriminative Model argmax a,b (1 λ)a T X Y T b + λ ij L ij m ij s.t. length constraints Similar to Generative Model it can be solved exactly We use it to solve Discriminative-Style Model

Optimizing Discriminative-Style Model α ij = L ij Repeat // Initialization 1 2 3 [A (i),b (i) ] = Softened-Disc(X,Y, ) m ij If End if = a T x i y i T b a T x i ŷ ij b ψ ij = (1 m ij )L ij ξ i = min {0, ψ ij s.t. ψ ij >0 } ξ i > 0 d ij α ij = m ij (1 ξ i ) L ij α ij γ d ij Until convergence α ij // Get the current soln. // Compute margins // Potential Slack // Compute Slack // Update the Lagrangian variables // Slack doesn't change

Outline Introduction Low-Dimensional Reranking Models Generative-Style Model Discriminative-Style Model Softened-Discriminative Model Experiments Conclusion

Reranking for POS Tagging HMM Trigram tagger is used to generate 10-best list Use multi-fold cross validation Training Use one of our models to find mappings A and B

Reranking for POS Tagging HMM Trigram tagger is used to generate 10-best list Use multi-fold cross validation Training Use one of our models to find mappings A and B Reranking

Reranking for POS Tagging HMM Trigram tagger is used to generate 10-best list Use multi-fold cross validation Training Use one of our models to find mappings A and B Reranking For every candidate tag sequence Compute cosine similarity in the sub-space Combine this score with Viterbi decoding score Interpolation parameter is tuned

Experimental Results Data Sets CoNLL shared task on Dependency Parsing Use the fine-grained POS tags Baseline system HMM Trigram Tagger Thede and Harper, 1999 Baseline Rerankers Collins and Koo, 2005 Its regularized version Huang et al. 2007

Features For our models: Sentence is represented as bag of suffixes of length 2 to 4 (x) For POS tag-sequences, we use Unigram and Bigram POS tags as features (y) For baseline Rerankers Feature of the form (suffix, Tag 0 ), (suffix,tag 0,Tag -1 )

Results English Chinese French Swedish Baseline 96.15 92.31 97.41 93.23 Collins 96.06 92.81 97.35 93.44 Regularized 96.00 92.88 97.38 93.35 Oracle 98.39 98.19 99.00 96.48 Oracle improvements are less compared to Parsing

Results English Chinese French Swedish Baseline 96.15 92.31 97.41 93.23 Collins 96.06 92.81 97.35 93.44 Regularized 96.00 92.88 97.38 93.35 Generative 96.24 92.95 97.43 93.26 Softened-Disc 96.32 92.87 97.53 93.36 Discriminative 96.3 92.91 97.53 93.36 Oracle 98.39 98.19 99.00 96.48 One of our models performs better

Results Performance is not very sensitive to hyperparams Combination with Viterbi Score is crucial English Chinese French Swedish Generative 96.24 92.95 97.43 93.26 Softened-Disc 96.32 92.87 97.53 93.36 Discriminative 96.3 92.91 97.53 93.36 Generative 94.83 89.89 96.10 91.89 Softened-Disc 95.04 89.61 95.97 91.95 Discriminative 94.95 89.76 95.82 92.11 Performance drops by more than 2%

Conclusions Family of rerankers based on dim. reduction Max-margin CCA [Szedmak et al. 2007] Max-margin regression [Szedmak et al. 2006,Wang et al. 2007] Supervised Semantic Hasing [Bai et al. 2010,Grangier et al. 2006] Learns a linear classifier in the joint feature space But weights are constrained Perfect when defining the joint features is not clear Scalability Online SVD algorithms [Halko et al. 2009] Adaptive variants of CCA [Yger et al. 2012]

Thank You!!

Performance with regularization param. (tau) 96.86 96.84 96.82 96.8 96.78 96.76 96.74 96.72 0.82 0.86 0.9 0.94 0.98 Discriminative Softened-Disc Generative Baseline

Performance with weight param. (lambda) 96.86 96.84 96.82 96.8 96.78 96.76 96.74 96.72 0.03 0.07 0.11 0.15 0.19 0.5 0.01 0.05 0.09 0.13 0.17 0.3 0.7 Discriminative Softened-Disc Generative Baseline