Cutting Plane Training of Structural SVM

Similar documents
Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

1 SVM Learning for Interdependent and Structured Output Spaces

Support Vector Machines: Maximum Margin Classifiers

10701 Recitation 5 Duality and SVM. Ahmed Hefny

Lecture Support Vector Machine (SVM) Classifiers

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Machine Learning. Support Vector Machines. Manfred Huber

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machine (SVM) and Kernel Methods

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support vector machines

Support Vector Machines, Kernel SVM

Support Vector Machine (SVM) and Kernel Methods

Introduction to Support Vector Machines

A Support Vector Method for Multivariate Performance Measures

CS-E4830 Kernel Methods in Machine Learning

Jeff Howbert Introduction to Machine Learning Winter

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

ICS-E4030 Kernel Methods in Machine Learning

Accelerating Stochastic Optimization

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines for Classification and Regression

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

A Review of Linear Programming

Perceptron Revisited: Linear Separators. Support Vector Machines

Support vector machines Lecture 4

Convex Optimization and Support Vector Machine

Support Vector Machines

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Algorithms for Predicting Structured Data

Lecture 10: Support Vector Machine and Large Margin Classifier

Kernelized Perceptron Support Vector Machines

Support Vector Machines and Kernel Methods

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machines

Probabilistic Graphical Models

Announcements - Homework

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Lecture Notes on Support Vector Machine

Support Vector Machines

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

A Distributed Solver for Kernelized SVM

Support Vector Machines

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

Multiclass Classification-1

Support Vector Machines Explained

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning And Applications: Supervised Learning-SVM

CSC 411 Lecture 17: Support Vector Machine

Dan Roth 461C, 3401 Walnut

Linear Models. min. i=1... m. i=1...m

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Probabilistic Graphical Models

Solving the SVM Optimization Problem

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Support Vector Machines

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

CS-E4830 Kernel Methods in Machine Learning

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

HOMEWORK 4: SVMS AND KERNELS

Review: Support vector machines. Machine learning techniques and image analysis

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Payment Rules for Combinatorial Auctions via Structural Support Vector Machines

Machine Learning for NLP

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Introduction to Logistic Regression and Support Vector Machine

Support Vector Machine

Sequential Supervised Learning

Chapter 1 Linear Programming. Paragraph 5 Duality

- Well-characterized problems, min-max relations, approximate certificates. - LP problems in the standard form, primal and dual linear programs

Manfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research. Updated: March 23, Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62

An Introduction to Machine Learning

12. Interior-point methods

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Lecture 3 January 28

Support Vector Machine (continued)

Support Vector Machines

Kernel Methods and Support Vector Machines

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Classification and Support Vector Machine

Max Margin-Classifier

Lecture 3: Multiclass Classification

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

1 Machine Learning Concepts (16 points)

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Lecture 9: PGM Learning

Coordinate Descent and Ascent Methods

How Good is a Kernel When Used as a Similarity Measure?

Large Scale Semi-supervised Linear SVMs. University of Chicago

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Introduction to Support Vector Machines

COMP 875 Announcements

Transcription:

Cutting Plane Training of Structural SVM Seth Neel University of Pennsylvania sethneel@wharton.upenn.edu September 28, 2017 Seth Neel (Penn) Short title September 28, 2017 1 / 33

Overview Structural SVMs have shown promise for building accurate structural prediction models in natural language processing and other domains Current training algorithms (while polynomial time) are expensive on large datasets; superlinear Paper proposes a new cutting plane training method that runs in linear time Number of iterations until convergence is independent of the number of training examples Empirical work shows that it is much faster than conventional cutting plane methods Seth Neel (Penn) Short title September 28, 2017 2 / 33

SVM Binary classification (primal): 1 min w,ɛ i 0 2 w 2 + C n s.t. i {1,... n} : y i (w T x i ) 1 ɛ i i ɛ i Seth Neel (Penn) Short title September 28, 2017 3 / 33

Hinge Loss Hinge loss is a convex upper bound for the 0 1 loss. The SVM can be represented as unconstrained minimization of the regularized hinge loss. Seth Neel (Penn) Short title September 28, 2017 4 / 33

Hinge Loss Hinge loss is a convex upper bound for the 0 1 loss. The SVM can be represented as unconstrained minimization of the regularized hinge loss. Seth Neel (Penn) Short title September 28, 2017 4 / 33

Multiclass SVM Recall: multiclass margin is defined as the score difference between the highest scoring label and the second highest min 1 2 k w k 2 + C i i=1 ɛ i s.t. i, (x i, y i ) D, k y i : w t y i x i w t k x i 1 ɛ i, ɛ i 0 Seth Neel (Penn) Short title September 28, 2017 5 / 33

Structured Prediction Learn a function f : X Y where Y is a space of multivariate and structured outputs for a given x predict f w (x) = argmax y Y w T Ψ(x, y) intuitively Ψ measures the compatibility of y and x Seth Neel (Penn) Short title September 28, 2017 6 / 33

Structured SVM Assume that f w (x, y) = w t Ψ(x, y) We generalize the SVM optimization problem to train w The true loss function is (y, h w (x)) = 1{y h w (x)} hinge loss provides a convex upper bound to the true loss function: Seth Neel (Penn) Short title September 28, 2017 7 / 33

Structured SVM Assume that f w (x, y) = w t Ψ(x, y) We generalize the SVM optimization problem to train w The true loss function is (y, h w (x)) = 1{y h w (x)} hinge loss provides a convex upper bound to the true loss function: margin-rescaling: MR (h w (x), y) = max y Y{ (y, y ) w t Ψ(x, y) + w t Ψ(x, y )} slack-rescaling: SR (h w (x), y) = max y Y{ (y, y ) (1 w t Ψ(x, y) + w t Ψ(x, y ))} Conceptually similar; we focus on MR. Seth Neel (Penn) Short title September 28, 2017 7 / 33

n-slack structural svm optimization (MR) 1 min w,ɛ 0 2 w 2 + C n s.t. y Y, i [n] : w t [Ψ(x i, y i ) Ψ(x i, y )] (y i, y ) ɛ i Note that if Ψ is the Y p dimensional embedding we recover the multiclass formulation, where p is the dimension of x i. i ɛ i Seth Neel (Penn) Short title September 28, 2017 8 / 33

Past Approaches Note the n-slack optimization problem has Y n constraints, not obviously efficiently solvable In fact it is: a greedily constructed cutting plane model requires only O(n/ɛ 2 ) constraints. other methods: exponentiated gradient methods, Taskar reformulation as a QP, stochastic gradient methods. Seth Neel (Penn) Short title September 28, 2017 9 / 33

Cutting Plane Algorithm 1 Seth Neel (Penn) Short title September 28, 2017 10 / 33

Cutting Plane Algorithm 1 This algorithm is efficient assuming existence of a separation oracle calculating the most violated constraint [line 5] note for natural choices of this is the assumption that we can efficiently solve the inference problem Seth Neel (Penn) Short title September 28, 2017 11 / 33

Contribution (this paper): Faster Training! Using a simple reformulation of the optimization problem: 1 constant iteration complexity = linear runtime in n 2 several orders of magnitude improvement in runtime (worst case analysis) 3 empirical study shows speedup in practice Seth Neel (Penn) Short title September 28, 2017 12 / 33

1-slack formulation Theorem The n-slack formulation is equivalent to the optimization: 1 n w t n i=1 1 min w,ɛ 0 2 w 2 + CE s.t. (y 1, y 2,... y n) Y n : [Ψ(x i, y i ) Ψ(x i, y i )] 1 n i (y i, y i ) E Note now we only have 1 slack variable shared among all constraints, but we actually have exponentially more constraints Seth Neel (Penn) Short title September 28, 2017 13 / 33

What is the point? New reformulation has constraint that binds on linear combination of data points = support vectors are linear combinations of data points This flexibility gives a much sparser set of non-zero dual variables, which translates into a smaller cutting plane model (constant) High-level: exponentially more constraints, but only a constant number of them matter Seth Neel (Penn) Short title September 28, 2017 14 / 33

1-slack: proof It suffices to show that for any fixed w, the smallest feasible E in the 1-slack formulation and smallest feasible 1 n i ɛ i in the n-slack formulation are equal. Seth Neel (Penn) Short title September 28, 2017 15 / 33

1-slack: proof It suffices to show that for any fixed w, the smallest feasible E in the 1-slack formulation and smallest feasible 1 n i ɛ i in the n-slack formulation are equal. Proof. In n-slack, for a fixed w, for each i, each constraint gives ɛ i (y, y i ) w t [Ψ(x i, y i ) Ψ(x i, y )], which shows that the smallest feasible is ɛ i = max y { (y, y i ) w t [Ψ(x i, y i ) Ψ(x i, y )]} In 1-slack the smallest feasible E is similarly at: max y 1,y 2,...,y n{ 1 n ( (y i, y i ) w t [Ψ(x i, y i ) Ψ(x i, y )])} i Seth Neel (Penn) Short title September 28, 2017 15 / 33

1-slack: proof ctd Proof. But this function separates over each y i and so it is equal to: 1 n i max y i ( (y i, y i ) w t [Ψ(x i, y i ) Ψ(x i, y )]) which as defined in the 1-slack formulation is simply 1 n ɛ i. Seth Neel (Penn) Short title September 28, 2017 16 / 33

The Main Algorithm Similar to the previous cutting plane algorithm, only adds 1 constraint in each iteration. What is different is that only a constant number of constraints are sufficient to find an ɛ-approximate solution. Seth Neel (Penn) Short title September 28, 2017 17 / 33

Correctness Obvious; when (if) the algorithm terminates: 1 The objective is optimized over a strictly smaller set of constraints, and hence has a smaller value 2 If the algorithm terminates the solution is approximately feasible Seth Neel (Penn) Short title September 28, 2017 18 / 33

Iteration Complexity Let = max i,y (y i, y ). Let R = max i,y Ψ(x i, y i ) Ψ(x i, y ). Then the iteration complexity is 16R2 C ɛ + log 2 ( 4R 2 C ) Seth Neel (Penn) Short title September 28, 2017 19 / 33

Iteration Complexity: Outline s.t. (y 1, y 2,... y n) Y n : 1 n w t 1 min w,ɛ 0 2 w 2 + CE n i=1 [Ψ(x i, y i ) Ψ(x i, y i )] 1 n The pair (w, E) = (0, ) is feasible in the above Thus the optimal value is C. This means that the dual program has optimal value C i (y i, y i ) E We show each iteration increases the dual objective by some constant amount, forcing constant iteration complexity. Seth Neel (Penn) Short title September 28, 2017 20 / 33

Classic SVM Dual Binary classification (primal): 1 min w,ɛ i 0 2 w 2 + C n s.t. i {1,... n} : y i (w T x i ) 1 ɛ i i ɛ i Seth Neel (Penn) Short title September 28, 2017 21 / 33

Classic SVM Dual Binary classification (primal): Binary classification (dual): 1 min w,ɛ i 0 2 w 2 + C n s.t. i {1,... n} : y i (w T x i ) 1 ɛ i min α i ɛ i n α i 1 α i α j y i y j xi t x j n i=1 i,j s.t. n α i y i = 0 i=1 i {1,... n} : 0 α i C n Seth Neel (Penn) Short title September 28, 2017 21 / 33

The Dual Program The dual program is easy to compute by forming the Lagrangian and taking derivatives. First we define: Seth Neel (Penn) Short title September 28, 2017 22 / 33

Line Search Lemma Proof Note that the dual objective is a QP of the form Θ(α) = h t α 1 2 αt Hα where h = { (ȳ)}ȳ Y n, H a,b = α a α b H MR (y a, y b ) Seth Neel (Penn) Short title September 28, 2017 23 / 33

Proof of Line Search Lemma. Proof. Direct calculation yields: Θ(α + βη) Θ(α) = β[ Θ(α) T η 1 2 βηt Hη] Setting the derivative with respect to β equal to 0 gives β = θ(α)t η η T Hη. Substituting this value of β gives max Θ(βη + α) Θ(α) = 1 ( Θ(α) T η) 2 β 2 η T Hη This is the maximum value, unless the value of β at the optimum is > C, in which case the optimum occurs at β = C, by concavity in β. Substituting in β = C gives the desired result. Seth Neel (Penn) Short title September 28, 2017 24 / 33

Iteration Complexity (Sketch) In step 8 we add a constraint. Seth Neel (Penn) Short title September 28, 2017 25 / 33

This adds a variable to the dual program: αŷ. Note can set αŷ = 0, and so it only increases the objective value. Seth Neel (Penn) Short title September 28, 2017 26 / 33

Adding a new constraint ŷ to the model adds a new parameter to the dual problem; increases its objective value Seth Neel (Penn) Short title September 28, 2017 27 / 33

Adding a new constraint ŷ to the model adds a new parameter to the dual problem; increases its objective value We pick an η : ηŷ = 1, η y = 1 C α y dual-feasible (since η T 1 = 0) such that α + βη is always Seth Neel (Penn) Short title September 28, 2017 27 / 33

Adding a new constraint ŷ to the model adds a new parameter to the dual problem; increases its objective value We pick an η : ηŷ = 1, η y = 1 C α y dual-feasible (since η T 1 = 0) such that α + βη is always The increase in objective value is lower bounded by the increase in a line search along η (since we are only searching a feasible region of the dual) Seth Neel (Penn) Short title September 28, 2017 27 / 33

Adding a new constraint ŷ to the model adds a new parameter to the dual problem; increases its objective value We pick an η : ηŷ = 1, η y = 1 C α y dual-feasible (since η T 1 = 0) such that α + βη is always The increase in objective value is lower bounded by the increase in a line search along η (since we are only searching a feasible region of the dual) Line search lemma (with some algebra) lets us lower bound this increase as a function of ɛ, C, R. Seth Neel (Penn) Short title September 28, 2017 27 / 33

Time Complexity Summary O(n) calls to the separation oracle O(n) computation time per iteration Constant number of iterations Each QP is of constant size, and is hence solved in constant time. For non-linear kernel it is O(n 2 ) Seth Neel (Penn) Short title September 28, 2017 28 / 33

Experiments Applied to binary classification, multi-class classification, linear chain HMMs, and CFG Grammar learning Two Questions: 1 Does the 1-slack algorithm achieve significant speedups in practice? 2 Is the w value as good a solution? Seth Neel (Penn) Short title September 28, 2017 29 / 33

Speed: 1-slack vs. n-slack Slower speedup with slower separation oracle - still significant gains Seth Neel (Penn) Short title September 28, 2017 30 / 33

Speed: 1-slack vs. SVM light on binary classification Faster even on binary classification than state-of-the-art SVM training algorithms Seth Neel (Penn) Short title September 28, 2017 31 / 33

Accuracy vs. n-slack Very similar for all tasks except HMM where n-slack outperforms This is because the duality gap Cɛ is significant part of the objective value since the data was almost linearly separable. Generalization performance for HMM was still comparable. Seth Neel (Penn) Short title September 28, 2017 32 / 33

References Joachims, Finley, Yu (2009) Cutting Plane Training of Structural SVMs Seth Neel (Penn) Short title September 28, 2017 33 / 33