A Support Vector Method for Multivariate Performance Measures

Similar documents
Generative Models for Classification

Cutting Plane Training of Structural SVM

Machine Learning for Structured Prediction

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

Linear Classifiers IV

Structured Output Prediction: Generative Models

Support Vector Machines: Kernels

Probabilistic Graphical Models

Algorithms for Predicting Structured Data

Support Vector and Kernel Methods

Chunking with Support Vector Machines

Intelligent Systems (AI-2)

Structured Prediction

Lecture 13: Structured Prediction

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Intelligent Systems (AI-2)

Lecture 15. Probabilistic Models on Graph

Lab 12: Structured Prediction

Structured Prediction Theory and Algorithms

LECTURER: BURCU CAN Spring

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

10/17/04. Today s Main Points

AUC Maximizing Support Vector Learning

Probabilistic Context-free Grammars

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Advanced Natural Language Processing Syntactic Parsing

IMPROVED LEARNING OF STRUCTURAL SUPPORT VECTOR MACHINES: TRAINING WITH LATENT VARIABLES AND NONLINEAR KERNELS

Introduction to Support Vector Machines

An Introduction to Machine Learning

Introduction to Support Vector Machines

Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Latent Variable Models in NLP

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Lecture 3: Multiclass Classification

Lagrangian Relaxation Algorithms for Inference in Natural Language Processing

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support Vector Machines

Structured Prediction: A Large Margin Approach. Ben Taskar University of Pennsylvania

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Learning to Align Sequences: A Maximum-Margin Approach

NLP Programming Tutorial 11 - The Structured Perceptron

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Search-Based Structured Prediction

1 SVM Learning for Interdependent and Structured Output Spaces

Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for Natural Language Processing

2.2 Structured Prediction

Structured Prediction

Statistical Methods for NLP

Support vector machines Lecture 4

Nearest Neighbors Methods for Support Vector Machines

Sequential Supervised Learning

Empirical Risk Minimization, Model Selection, and Model Assessment

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Probabilistic Context Free Grammars. Many slides from Michael Collins

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

11. Learning To Rank. Most slides were adapted from Stanford CS 276 course.

Learning to translate with neural networks. Michael Auli

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines for Classification: A Statistical Portrait

L5 Support Vector Classification

Machine Learning. Support Vector Machines. Manfred Huber

Evaluation. Andrea Passerini Machine Learning. Evaluation

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

AN ABSTRACT OF THE DISSERTATION OF

Learning with kernels and SVM

Information Retrieval

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS

Evaluation requires to define performance measures to be optimized

Machine Learning for NLP

Support Vector Machines

Support Vector Machine & Its Applications

Personal Project: Shift-Reduce Dependency Parsing

Multiclass Classification-1

CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss

Probabilistic Graphical Models

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

SUPPORT VECTOR MACHINE

Polyhedral Outer Approximations with Application to Natural Language Parsing

Generalized Linear Classifiers in NLP

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

The Noisy Channel Model and Markov Models

Information Retrieval

CS798: Selected topics in Machine Learning

CS-E4830 Kernel Methods in Machine Learning

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Statistical NLP Spring A Discriminative Approach

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Statistical Methods for NLP

Collapsed Variational Bayesian Inference for Hidden Markov Models

Hidden Markov Models

Fast Logistic Regression for Text Categorization with Variable-Length N-grams

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

ML4NLP Multiclass Classification

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Transcription:

A Support Vector Method for Multivariate Performance Measures Thorsten Joachims Cornell University Department of Computer Science Thanks to Rich Caruana, Alexandru Niculescu-Mizil, Pierre Dupont, Jérôme Callut

Supervised Learning Find function from input space X to output space Y such that the prediction error is low. Text Classification: F 1 -Score Precision/Recall Break-Even (PRBEP) Medical Diagnosis: ROC Area Information Retrieval: Precision at 10

Related Work Approach Estimate Probabilities E.g. [Platt, 2000] [Langford & Zadrozny, 2005] [Niculescu- Mizil & Caruana, 2005] Potentially solve harder problem than required Approach Optimize Substitute Loss, then Post-Process E.g. [Lewis, 2001] [Yang, 2001] [Abe et al. 2004] [Caruana & Niculescu-Mizil, 2004] Typically multi-step approach, cross-validation Approach Directly Optimize Desired Loss Linear cost models: e.g. [Morik et al., 1999] [Lin et al., 2002] ROC-Area: e.g. [Herbrich et al. 2000] [Rakotomamonjy, 2004] [Cortes & Mohri, 2003] [Freund et al., 1998] [Yan et al., 2003] [Ferri et al., 2002] F 1 -Score: difficult [Musicant et al. 2003]

Overview Formulation of Support Vector Machine for any loss function that can be computed from the contingency table. F1-score, Error Rate, Linear Cost Models, etc. any loss function that can be computed from contingency tables with cardinality constraints. PRBEP, Prec@k, Rec@k, etc. ROC-Area Polynomial Time Algorithm Conventional classification SVM is special case New optimization problem New representation and (extremely sparse) support vectors

Optimizing F 1 -Score F 1 -score is non-linear function of example set F 1 -score: harmonic average of precision and recall p 0.4 0.1 0.3 0.6 0.2 0.0 0.9 For example vector x 1. Predict y 1 =1, if P(y 1 =1 x 1 )=0.4? Depends on other examples! 1 F 1 0 0 x x threshold 1 p y 0.4 0.1 0.3 y 0.4 +1 0.2 0.0 0.3 +1 1 F 1 0 0 threshold 1

Approach: Multivariate Prediction Training Data: Conventional Setting: learn Multivariate Setting: learn Note: If then both settings are equivalent.

Support Vector Machine [Vapnik et al.] Training Examples: Hypothesis Space: with Training: Find hyperplane with minimal Optimization Problem: Hard Margin (separable) d d d Soft Margin (training error)

Multivariate Support Vector Machine Approach: Linear Discriminant [Collins 2002] [Lafferty et al. 2002] [Taskar et al. 2004] [Tsochantaridis et al. 2004] etc. Learn weights so that is max for correct. x y +1 x -0.1-0.7 +1-3.1 +1.2-0.3.2 +0.3 With y sign +1 +1

Multivariate SVM Optimization Problem Approach: Structural SVM [Taskar et al. 04] [Tsochantaridis et al. 04] Hard-margin optimization problem: Soft-margin optimization problem: Theorem: At the solution, the training loss is upper bounded by.

Multivariate SVM Generalizes Classification SVM Theorem: The solutions of the multivariate SVM with number of errors as the loss function and an (unbiased) classification SVM are equal. Multivariate SVM optimizing Error Rate: Classification SVM (unbiased):

Input: Cutting Plane Algorithm for Multivariate SVM Approach: Sparse Approx. Structural SVM [Tsochantaridis et al. 04] REPEAT compute IF Find most violated constraint Violated by more than? optimize SVM objective over ENDIF UNTIL has not changed during iteration Add constraint to working set

Polynomial Convergence Bound Theorem [Tsochantaridis et al., 2004]: The sparseapproximation algorithm finds a solution to the soft-margin optimization problem after adding at most constraints to the working set, so that the Kuhn-Tucker conditions are fulfilled up to a precision. The loss has to be bounded, and.

ARGMAX for Contingency Table Problem: Key Insight: Only n 2 different contingency tables exist. ARGMAX for each table easy to compute via sorting. Time O(n 2 ) Applies to: Errorrate, F 1, Prec@k, Rec@k, PRBEP, etc.

Problem: ARGMAX for ROC-Area Key Insight: ROC Area is proportional to swapped pairs Loss decomposes linearly over pairs Find argmax via sort in time O(n log n) Represent n 2 pairs as

Experiment: Generalization Performance Experiment Setup Macro-average over all classes in dataset Baseline: classification SVM with linear cost model Select C and cost ratio j via 2/3 1/3 holdout test Two-tailed Wilcoxon (**=95%, *=90%)

Experiment: Unbalanced Classes in Reuters

Experiment: Number of Iterations Numbers averaged over all binary classification tasks within dataset all parameter settings

Experiment: Number of SV Corollary: Number For of Support error rate Vectors, as the loss averaged function, over the hard-margin solution all after binary the classification first iteration tasks is equal within to Rocchio dataset Algorithm. all parameter settings Caution: Different convergence criteria.

Implementation in SVM struct Multivariate SVM implemented in SVM struct http://svmlight.joachims.org Also implementations for CFG, Sequence Alignment, OMM Application specific Loss function Representation Algorithms to compute Generic structure that covers OMM, MPD, Finite-State Transducers, MRF, etc. (polynomial time inference)

Conclusions Generalization of SVM to multivariate loss functions Classification SVMs are special case Polynomial time training algorithms for any loss function based on contingency table. ROC-Area. New representation of SVM optimization problem Support Vectors represent vector of classifications Can be extremely sparse Future work Other performance measures, other methods (e.g. boosting) Faster training algorithm exploiting special structure

Joint Feature Map Feature vector that describes match between x and y Learn single weight vector and rank by x Problems The dog chased the cat y S How to predict 2 VP efficiently? VP NP How to learn V efficiently? N V Det N y 1 NP S Manageable number of VP parameters? NP y 12 S Det NP N V Det VP NP N y 4 y 34 Det NP N NP Det N Det N V S Det VP N S VP NP V Det NP N V Det N y 58 VP Det N NP S V Det NP N

Joint Feature Map for Trees Weighted Context Free Grammar Each rule (e.g. S NP VP) has a weight Score of a tree is the sum of its weights Find highest scoring tree CKY Parser x The dog Problems 1 chased the cat How to predict efficiently? 0 f : X YHow to learn efficiently? 2 y Manageable S number of parameters? 1 NP VP ( x, y) NP 0 2 Det N V Det N 1 1 The dog chased the cat 1 S NP VP S NP NP Det N VP V NP Det dog Det the N dog V chased N cat

Experiment: Natural Language Parsing Implemention Implemented Sparse-Approximation Algorithm in SVM light Incorporated modified version of Mark Johnson s CKY parser Learned weighted CFG with Data Penn Treebank sentences of length at most 10 (start with POS) Train on Sections 2-22: 4098 sentences Test on Section 23: 163 sentences

More Expressive Features Linear composition: So far: General: Example:

Test Accuracy (%) Experiment: Part-of-Speech Tagging Task Given a sequence of words x, predict sequence of tags y. 97.00 96.50 96.00 95.50 95.00 94.50 94.00 x The dog chased the cat 95.78 Brill (RBT) 95.63 HMM (ACOPOST) 95.02 Det N 94.68 V 95.75 knn (MBT) Tree Tagger SVM Multiclass (SVM-light) Det Dependencies from tag-tag transitions in Markov model. Model Markov model with one state per tag and words as emissions Each word described by ~250,000 dimensional feature vector (all word suffixes/prefixes, word length, capitalization ) Experiment (by Dan Fleisher) Train/test on 7966/1700 sentences from Penn Treebank y 96.49 N SVM-HMM (SVM-struct)

Overview Task: Discriminative learning with complex outputs Related Work SVM algorithm for complex outputs Formulation as convex quadratic program General algorithm Sparsity bound Example 1: Learning to parse natural language Learning weighted context free grammar Example 2: Optimizing F 1 -score in text classification Learn linear rule that directly optimizes F 1 -score Example 3: Learning to cluster Learning a clustering function that produces desired clusterings Conclusions

Overview Task: Discriminative learning with complex outputs Related Work SVM algorithm for complex outputs Formulation as convex quadratic program General algorithm Sparsity bound Example 1: Learning to parse natural language Learning weighted context free grammar Example 2: Optimizing F 1 -score in text classification Learn linear rule that directly optimizes F 1 -score Example 3: Learning to cluster Learning a clustering function that produces desired clusterings Conclusions

x Noun-Phrase Co-reference Learning to Cluster Given a set of noun phrases x, predict a clustering y. Structural dependencies, since prediction has to be an equivalence relation. Correlation dependencies from interactions. y The policeman fed The policeman fed the cat. He did not know that he was late. The cat is called Peter. the cat. He did not know that he was late. The cat is called Peter.

Struct SVM for Supervised Clustering (by Thomas Finley) Representation 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 y is reflexive (y ii =1), symmetric (y ij =y ji ), and 0 transitive 0 0 0 (if 1 y1 ij =1 1 and y jk =1, then y ik =1) 0 0 0 0 1 1 1 Joint feature map 0 0 0 0 1 1 1 Loss Function Prediction y 1 1 1 1 0 0 0 y 1 1 1 0 0 0 0 NP hard, 1 use 1 1 linear 1 0 relaxation 0 0 instead 1 1 [Demaine 1 0 0 & 0 Immorlica, 0 2003] Find most 1 violated 1 1 1 constraint 0 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 NP hard, 0 use 0 0 linear 0 1 relaxation 1 1 instead 0 0 [Demaine 0 0 1 & 1 Immorlica, 1 2003] 0 0 0 0 1 1 1 0 0 0 0 1 1 1 0 0 0 0 1 1 1 0 0 0 0 1 1 1 y

Conclusions Learning to predict complex output Clean and concise framework for many applications Discriminative methods are feasible and offer advantages An SVM method for learning with complex outputs Case Studies Learning to predict trees (natural language parsing) Optimize to non-standard performance measures (imbalanced classes) Learning to cluster (noun-phrase coreference resolution) Software: SVM struct http://svmlight.joachims.org/ Open questions Applications with complex outputs? Is it possible to extend other algorithms to complex outputs? More efficient training algorithms for special cases?

Examples of Complex Output Spaces Non-Standard Performance Measures (e.g. F 1 -score, Lift) F 1 -score: harmonic average of precision and recall New example vector. Predict y 8 =1, if P(y 8 =1 )=0.4? Depends on other examples! x y +1 +1

Struct SVM for Optimizing F 1 -Score Loss Function Representation Joint feature map Prediction x Find most violated constraint x -0.1-0.7-3.1 +1.2-0.3.2 +0.3 sign y +1 +1 y +1 +1 Only n 2 different contingency tables search brute force

Experiment: Text Classification Dataset: Reuters-21578 (ModApte) 9603 training / 3299 test examples 90 categories TFIDF unit vectors (no stemming, no stopword removal) Experiment Setup Classification SVM with optimal C in hindsight (C=8) F 1 -loss SVM with C=0.0625 (via 2-fold cross-validation) Results