Linear Classifiers IV

Similar documents
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Bayesian Learning (II)

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

An Introduction to Machine Learning

Lecture 13: Structured Prediction

A Support Vector Method for Multivariate Performance Measures

Machine Learning for Structured Prediction

Hidden Markov Models in Language Processing

Multiclass and Introduction to Structured Prediction

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Multiclass and Introduction to Structured Prediction

Discriminative Models

Models, Data, Learning Problems

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Multiclass Classification

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Lab 12: Structured Prediction

Pattern Recognition and Machine Learning

Algorithms for Predicting Structured Data

Discriminative Models

Intelligent Systems (AI-2)

Sequential Supervised Learning

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Linear & nonlinear classifiers

Machine Learning for NLP

6.036 midterm review. Wednesday, March 18, 15

Jeff Howbert Introduction to Machine Learning Winter

Lecture 9: PGM Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

CPSC 340: Machine Learning and Data Mining

Kernel methods, kernel SVM and ridge regression

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machines: Kernels

Probabilistic Graphical Models

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning for Signal Processing Bayes Classification and Regression

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Recommendation. Tobias Scheffer

Generalized Linear Classifiers in NLP

Intelligent Systems (AI-2)

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Introduction to Logistic Regression and Support Vector Machine

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Machine Learning for Signal Processing Bayes Classification

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Lecture 3: Multiclass Classification

Kernel Methods and Support Vector Machines

Machine Learning for NLP

Bowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations

COMP90051 Statistical Machine Learning

18.9 SUPPORT VECTOR MACHINES

Multiclass Classification-1

Logistic Regression. Machine Learning Fall 2018

Midterm sample questions

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Probabilistic Machine Learning. Industrial AI Lab.

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Introduction to Machine Learning Midterm, Tues April 8

Expectation Maximization (EM)

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Information Extraction from Text

STA 414/2104: Machine Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Hidden Markov Models

Topics in Structured Prediction: Problems and Approaches

Introduction to Machine Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Day 4: Classification, support vector machines

FINAL: CS 6375 (Machine Learning) Fall 2014

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Support Vector Machines

Energy Based Learning. Energy Based Learning

Structured Prediction

COMS 4771 Introduction to Machine Learning. Nakul Verma

Introduction to Support Vector Machines

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Multiclass Boosting with Repartitioning

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Expectation Maximization (EM)

Support Vector Machine. Industrial AI Lab.

Evaluation. Andrea Passerini Machine Learning. Evaluation

Log-Linear Models with Structured Outputs

ECS289: Scalable Machine Learning

Computational Genomics and Molecular Biology, Fall

Evaluation requires to define performance measures to be optimized

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Machine Learning for natural language processing

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

Linear & nonlinear classifiers

Structured Prediction

CSCI-567: Machine Learning (Spring 2019)

Transcription:

Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers IV Blaine Nelson, Tobias Scheffer

Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic Regression Regularized Empirical Risk Minimization Kernel Perceptron, Support Vector Machine Ridge Regression, LASSO Representer Theorem Dualized Perceptron, Dual SVM Mercer Map Learning with Structured Input & Output Taxonomy, Sequences, Ranking, Decoder, Cutting Plane Algorithm 2

Recall: Binary SVM Classification for two classes: y x = sign f θ x, f θ x = φ x T θ A parameter vector θ Optimization problem: min θ,ξ λ such that n i=1 ξ i + 1 2 θt θ ξ i 0 and y i φ x i T θ 1 ξ i Does this generalize to k classes? 3

Recall: Multiclass Logistic Regression In the multi-class case, the linear model has a decision function: f θ x, y = φ x T θ y + b y & a classifier: y x = argmax z Y Logistic model for multiclass: P y x, θ = exp φ x T θ y + b y z Y exp φ x T θ z + b z f θ x, z The prior is a normal distribution; p θ = N θ; 0, Σ The Maximum-A-Posteriori parameter is θ MAP = argmin θ n i=1 log Σ z Y exp f θ x i, z f θ x i, y i + θt Σ 1 θ 2 loss regularizer 4

Generalizing to Multiclass Problems Recall multiclass logistic regression y x = sign f θ x became y x = argmax y f θ x, y Single parameter vector θ became θ = θ 1,, θ k T with a component for each class For an SVM with k classes Regularizer? 1 2 θt θ becomes 1 2 Margin Constraints? k j=1 θ jt θ j 5

Generalizing to Multiclass Problems Recall multiclass logistic regression y x = sign f θ x became y x = argmax y f θ x, y Single parameter vector θ became θ = θ 1,, θ k T with a component for each class For an SVM with k classes Regularizer: 1 2 θt θ becomes 1 k θ jt θ j 2 j=1 Margin Constraints: y i φ x T i θ 1 ξ i becomes y y i f θ x i, y i f θ x i, y + 1 ξ i 6

Multiclass SVM Classification for more than two classes: y x = argmax y f θ x, y, f θ x, y = φ x T θ y A parameter vector for each of the k classes θ = θ 1,, θ k T Optimization problem: min θ,ξ λ such that n i=1 ξ i + 1 2 k j=1 ξ i 0 and y y i f θ x i, y i θ jt θ j f θ x i, y + 1 ξ i [J.Weston, C.Watkins, 1999] 7

Multiclass Feature-Mapping Different weight vectors can be seen as different slices of a single weight vector: θ = θ 1,, θ k T Joint representation of input and output: Φ x, y = 2 = 0 φ x 0 k concatenated feature vectors 8

Multiclass Feature-Mapping Classification for more than two classes: y x = argmax y Multiclass kernels: f θ x, y f θ x, y Λ y = = Φ x, y T θ I y = 1 I y = k Φ x, y = φ x Λ y = φ x I y = 1 φ x I y = k 9

Multiclass Kernel Encoding Classification for more than two classes: y x = argmax y Multiclass kernels: f θ x, y f θ x, y = Φ x, y T θ Φ x, y = x 1 x 2 I y = 1 I y = 2 = x 1 I y = 1 x 2 I y = 1 x 1 I y = 2 x 2 I y = 2 k x i, y i, x j, y j = Φ x i, y i T Φ x j, y j = I y i = y j k x i, x j 1 ify i =y j 0 otherwise 10

Multiclass Kernel Encoding Classification for more than two classes: y x = argmax y Multiclass kernels: f θ x, y f θ x, y = Φ x, y T θ Φ x, y = x 1 x 2 I y = 1 I y = 2 = x 1 I y = 1 x 2 I y = 1 x 1 I y = 2 x 2 I y = 2 k x i, y i, x j, y j = Φ x i, y i T Φ x j, y j = I y i = y j k x i, x j 1 if y i =y j 0 otherwise 11

Classification with Information over Classes Classes have their own features: y x = argmax y f θ x, y f θ x, y = Φ x, y T θ, Φ x, y = φ x Λ y k x i, y i, x j, y j = Φ x i, y i T Φ x j, y j = φ x i T φ x j Λ y i T Λ y j = k x i, x j Λ y i T Λ y j correspondence between classes 12

Classification with Information over Classes Classes have their own features: y x = argmax y f θ x, y f θ x, y = Φ x, y T θ, Φ x, y = φ x Λ y k x i, y i, x j, y j = Φ x i, y i T Φ x j, y j = φ x i T φ x j Λ y i T Λ y j = k x i, x j Λ y i T Λ y j correspondence between classes Correspondence of the classes Λ y i T Λ y j : E.g., inner product of the class descriptions. Λ y i : term-frequency (TF) vector over all words. 13

Multiclass SVM Classification for more than two classes: y x = argmax y f θ x, y f now has two arguments. Shared feature representation for input & output: f θ x, y = Φ x, y T θ, Same approach is used for multiclass, sequence and structured learning, ranking. 14

Multiclass SVM Example x is encoded; e.g., a document y = 2 is the encoded class Φ x, y = I y = 1 x I y = 2 x I y = 3 x I y = 4 x I y = 5 x I y = 6 x = 0 x 0 0 0 0 1 2 3 4 5 6 15

Multiclass SVM Example x is encoded; e.g., a document y = 2 is the encoded class Φ x, y = I y = 1 x I y = 2 x I y = 3 x I y = 4 x I y = 5 x I y = 6 x = 0 x 0 0 0 0 1 2 3 4 5 6 16

Multiclass SVM Classification for more than two classes: y x = argmax y f θ x, y, f θ x, y = φ x T θ y A parameter vector for each of the k classes θ = θ 1,, θ k T Optimization problem: min θ,ξ λ such that n i=1 ξ i + 1 2 k j=1 ξ i 0 and y y i f θ x i, y i θ jt θ j f θ x i, y + 1 ξ i 17

Multiclass SVM Classification for more than two classes: y x = argmax y f θ x, y, f θ x, y = Φ x, y T θ A parameter vector for each of the k classes θ = θ 1,, θ k T Optimization problem: min θ,ξ λ such that n i=1 ξ i + 1 2 θt θ ξ i 0 and y y i f θ x i, y i f θ x i, y + 1 ξ i 18

STRUCTURED MULTICLASS OUTPUT 19

Taxonomic Output Structure Suppose the k classes are related by an underlying tree structure (depth d): v 1 1 Homininae Hominini v 1 2 v 2 2 Gorillini Pan Homo Gorilla 3 3 3 v 1 v 2 v 3 Each class is a path in the tree; y = y 1,, y d. 20

Taxonomic Output Structure Suppose the k classes are related by an underlying tree structure (depth d): v 1 1 Homininae Hominini v 1 2 v 2 2 Gorillini Pan Homo Gorilla 3 3 3 v 1 v 2 v 3 Each class is a path in the tree; y = y 1,, y d. Chimpanzee = Homininae, Hominini, Pan 21

Taxonomic Output Structure Suppose the k classes are related by an underlying tree structure (depth d): v 1 1 Homininae Hominini v 1 2 v 2 2 Gorillini Pan Homo Gorilla 3 3 3 v 1 v 2 v 3 Each class is a path in the tree; y = y 1,, y d. Human = Homininae, Hominini, Homo 22

Taxonomic Output Structure Suppose the k classes are related by an underlying tree structure (depth d): v 1 1 Homininae Hominini v 1 2 v 2 2 Gorillini Pan Homo Gorilla 3 3 3 v 1 v 2 v 3 Each class is a path in the tree; y = y 1,, y d. W. Gorilla= Homininae, Gorillini, Gorilla 23

Taxonomic Output Structure Classes in a tree structure (depth d): The class for each x is a path in a class tree y = y 1,, y d. v 1 3 v 1 1 v 1 2 v 2 2 3 3 v 2 v 3 Encoding of the common features of input and output? 24

Classification with a Taxonomy Classes in a tree structure (depth d): y x f θ x, y = argmax y f θ x, y = Φ x, y T θ y = y 1,, y d Λ y = Λ y 1 Λ y d Λ y i = Φ x, y = φ x Λ y = φ x I y i = v 1 i i I y i = v ni Λ y 1 Λ y d = φ x v 1 3 v 2 3 I y 1 1 = v 1 I y 1 = v1 n1 I y d d = v 1 v 1 1 v 1 2 v 2 2 d I y d = v nd v 3 3 25

Classification with a Taxonomy Example x is encoded; e.g., a document y = v 1 1, v 2 2, v 3 3 T is a path; e.g., in a topic tree Φ x, y = I y 1 = v 1 1 x I y 2 = v 1 2 x I y 2 = v 2 2 x I y 3 = v 1 3 x I y 3 = v 2 3 x I y 3 = v 3 3 x = x 0 x 0 0 x v 1 3 v 1 1 v 1 2 v 2 2 3 3 v 2 v 3 26

Classification with a Taxonomy Example x is encoded; e.g., a document y = v 1 1, v 2 2, v 3 3 T is a path; e.g., in a topic tree Φ x, y = I y 1 = v 1 1 x I y 2 = v 1 2 x I y 2 = v 2 2 x I y 3 = v 1 3 x I y 3 = v 2 3 x I y 3 = v 3 3 x = x 0 x 0 0 x v 1 3 v 1 1 v 1 2 v 2 2 3 3 v 2 v 3 27

Classification with a Taxonomy Kernelization Classes in a tree structure (depth d): y x f θ x, y = argmax y f θ x, y = Φ x, y T θ v 1 3 v 1 1 v 1 2 v 2 2 3 3 v 2 v 3 y = y 1,, y d k x i, y i, x j, y j = Φ x i, y i T Φ x j, y j = φ x i T φ x j Λ y i T Λ y j d l = k x i, x j Λ y T l l=1 i Λ y j 28

Classification with a Taxonomy Decoding / Prediction Classes in a tree structure (depth d): f θ x, y y x = Φ x, y T θ = argmax f θ x, y y = argmax y = argmax y = argmax y n i=1 n i=1 n i=1 α i k x i, y i, x, y α i k x i, x Λ y i T Λ y v 1 3 d l T α i k x i, x Λ y i Λ y l l=1 v 1 1 v 1 2 v 2 2 3 3 v 2 v 3 29

Structured Classification The structure of the output class Y is encoded by defining Λ y The joint feature map Φ x, y = φ x Λ y captures the information from both input & class structure allowing learning to utilize that structure This defines a multiclass kernel on the input & output: k x i, y i, x j, y j = k x i, x j Λ y i T Λ y j 30

COMPLEX STRUCTURES 31

Complex Structured Output Suppose output space Y contains complex objects. Can they be represented as a combination of binary prediction problems? Examples: Part-of-speech and named entity recognition Natural language parsing Sequence alignment 32

Complex Output Tagging Sequential input / output Part-of-speech recognition x = Curiosity kills the cat y = Noun, Verb, Determiner, Noun Named entity recognition, information extraction: x = Barbie meets Ken y = Person,, Person 33

Complex Output Parsing Natural Language Parsing x = Curiosity killed the cat. NP S VP NP N V Det N Curiosity killed the cat 34

Complex Output Alignments Sequence Alignment We are given two sequences Prediction is an alignment between them x = s=(abjlhbnjyaugai) t=(bhjkbnygu) AB-JLHBNJYAUGAI BHJK-BN-YGU 35

Complex Structured Output Output space Y contains complex objects. A multistage process propagates its errors. Why isn t this just a simple multiclass problem? y 1 VP S VP NP Det V N V N The dog chased the cat y 2 NP S VP NP Det Det N V N y k VP S NP NP Det Det N V N 36

Learning with Complex Structured Output Example: POS-Tagging Sentence x = Curiosity killed the cat We need to compute: argmax Φ x, y T θ N, V, Det, N y Explicit: Φ x, N, V, Det, N T θ Φ x, N, N, N, N T θ Φ x, N, V, Det, N T θ Φ x, N, N, N, V T θ Φ x, N, V, Det, N T θ Φ x, N, N, V, N T θ Φ x, N, V, Det, N T θ Φ x, N, N, V, V T θ Φ x, N, V, Det, N T θ Φ x, N, V, N, N T θ Too Many!!! 37

Complex Structured Output Output space Y contains complex objects. A multistage process propagates its errors. Exponential number of classes potentially leads to exponentially-many parameters to estimate; inefficient predictions; an inefficient learning algorithm. 38

Complex Structured Output Exponential number of classes potentially leads to exponentially-many parameters to estimate; inefficient predictions; an inefficient learning algorithm. Classification for more than two classes: y x f θ x, y Φ x, y T θ = argmax y = argmax y To reduce the number of parameters, one needs an efficient representation of the input & output Φ x, y. This representation depends on the concrete problem definition. 39

Example: Sequential Input & Output (Feature-Mapping) y 1 y 2 y 3 x 1 x 2 x 3 Curiosity killed the y 4 x 4 cat An attribute for every pair of adjacent labels y t and y t+1. φ N,V y t, y t+1 = I y t = Noun y t+1 = Verb An attribute for every pair of input and output. φ cat,n x t, y t = I x t = cat y t = Noun Label-label counts: Φ i = t φ i y t, y t+1 Label-observation: Φ j = t φ j x t, y t Joint feature representation: Φ x, y =, φ N,V y t, y t+1,, φ cat,n x t, y t, T t Weight vector: w =, w N,V,, w cat,n, T 40

Example: Sequential Input & Output (Decoding / Prediction) To classify a sequence, we must compute y x = argmax f θ x, y y The argmax goes over all possible sequences (exponentially-many in sequence length). f θ x, y = Φ x, y T θ sums over the features of adjacent label-label pairs und features of x i, y i pairs. The summands only differ where the y-sequences also differ. Using dynamic programming, the argmax can be computed in linear time (Viterbi Algorithm). 41

Example: Sequential Input & Output (Decoding / Prediction) To classify a sequence, we must compute y x = argmax f θ x, y y Using dynamic programming, the argmax can be computed in linear time (Viterbi Algorithm). Idea: we can compute the maximizing subsequences at time t given the maximizing subsequences at t 1 y t 1 = N V D t 1 γ N t 1 γ V t 1 γ D y t = N V D t γ N t γ V t γ D γ N t p N t = max w z,n + γ t 1 z + w cat,n z = argmax z x t = cat 42

Example: Sequential Input & Output (Decoding / Prediction) To classify a sequence, we must compute y x = argmax f θ x, y y Using dynamic programming, the argmax can be computed in linear time (Viterbi Algorithm). Idea: we can compute the maximizing subsequences at time t given the maximizing subsequences at t 1 y t 1 = N V D t 1 γ N t 1 γ V t 1 γ D y t = N V D t γ N t γ V t γ D γ N t p N t = max w z,n + γ t 1 z + w cat,n z = argmax z x t = cat 43

Example: Sequential Input & Output (Decoding / Prediction) To classify a sequence, we must compute y x = argmax f θ x, y y Using dynamic programming, the argmax can be computed in linear time (Viterbi Algorithm). Idea: we can compute the maximizing subsequences at time t given the maximizing subsequences at t 1 y t 1 = N V D t 1 γ N t 1 γ V t 1 γ D y t = N V D t γ N t γ V t γ D γ t V = max w z,v + γ t 1 z + w cat,v z p t V = argmax z x t = cat 44

Example: Sequential Input & Output (Decoding / Prediction) To classify a sequence, we must compute y x = argmax f θ x, y y Using dynamic programming, the argmax can be computed in linear time (Viterbi Algorithm). Idea: we can compute the maximizing subsequences at time t given the maximizing subsequences at t 1 y t 1 = N V D t 1 γ N t 1 γ V t 1 γ D y t = N V D t γ N t γ V t γ D γ t D = max w z,d + γ t 1 z + w cat,d z p t D = argmax z x t = cat 45

Example: Sequential Input & Output (Decoding / Prediction) To classify a sequence, we must compute y x = argmax f θ x, y y Using dynamic programming, the argmax can be computed in linear time (Viterbi Algorithm). Idea: we can compute the maximizing subsequences at time t given the maximizing subsequences at t 1 Once γ t is computed for entire sequence, maximize γ T and follow pointers p t back to find max sequence N y 1 y 2 V y 3 D y 4 N x 1 x 2 x 3 Curiosity killed the x 4 cat 46

Structured Output Space Sequential Input / Output Decoding: Viterbi Algorithm Feature-Mapping: Entries for adjacent states φ N,V y t, y t+1 = I y t = Noun y t+1 = Verb Entries for observations in a state φ cat,n x t, y t = I x t = cat y t = Noun 47

Structured Output Space Example: Natural Language Parsing y = x = NP N S VP NP V Det N Curiosity killed the cat φ NP N y = I y p = NP N p y Φ x, y = 1 1 0 1 0 1 S NP, VP NP N VP V VP V, NP N ate N cat φ N cat x, y = I y p = N cat p y Joint feature representation: Φ x, y =, φ NP N y,, φ N cat (x, y), T Weight vector: w =, w NP N,, w N cat, T Decoding with dynamic programming CKY-Parser, O n 3. 48

Structured Output Space Example: Collective Classification x 1 y 1 y 2 x 2 y 4 y 3 x 3 x 4 An attribute for every pair of adjacent labels y t and y t+1. φ 123 y i, y j An attribute for every pair of input and output. φ cat,n x t, y t = I y t = Institute x t = Decoder: Message Passing Algorithm. 49

Complex Structured Output Exponential number of classes potentially leads to exponentially-many parameters to estimate; inefficient predictions; an inefficient learning algorithm. Optimization problem: min θ,ξ λ such that n i=1 ξ i + 1 θ 2 2 2 ξ i 0 and y y i f θ x i, y i Exponentially many constraints f θ x i, y + 1 ξ i Efficient encoding 50

Learning with Complex Structured Output Optimization problem: min θ,ξ λ such that n i=1 ξ i + 1 θ 2 2 2 ξ i 0 and y y i f θ x i, y i Exponentially many constraints f θ x i, y + 1 ξ i Optimization through iterative training. Negative Constraints are added when an error occurs during training. [Tsochantaridis et al., 2004] 51

Learning with Complex Structured Output Cutting Plane Algorithm Given: L = x 1, y 1,, x n, y n Repeat until all sequences are correctly predicted. Iterate over all examples x i, y i. Compute y = argmax y y i Φ x i, y T θ If Φ x i, y i T θ < Φ x i, y T θ + 1 (Margin-Violation) then add the constraint Φ x i, y i T θ < Φ x i, y T θ + 1 ξ i to the working set of constraints. Solve the optimization problem for input x i, output y i and negative pseudo-examples y (in the working set) Return the learned θ. 52

Structured Output Spaces General Framework Customized in terms of Loss Function Joint representation of the input and output Decoder y = argmax y y i Φ x i, y T θ optionally with a loss function Implementation: http://www.cs.cornell.edu/people/tj/svm_light/svm_struct.html 53

Complex Structured Output Output space Y contains complex objects. A multistage process propagates its errors. Exponential number of classes, but fewer parameters to estimate; efficient predictions; efficient learning algorithm. Feature-Mapping: reduces the number of parameters for every class Cutting plane algorithm Problem-specific encoding 54

RANKING 55

Ranking Can we also learn with other types of structure? Ranking every prediction is an ordering We want to use this ordering information to improve our learning algorithm 56

Ranking Instances should be placed in the correct order. e.g., Relevance ranking of search results. e.g., Relevance ranking of product recommendations. Samples are pairs: L = f x i f x j, x i should appear before x j in the ranking 57

Ranking Relevance ranking of search results. Website x i, Search query q. Joint feature representation Φ x i, q for websites and search queries. Φ x i, q can be a vector of different features of that appear in both x i and q (ie, their correspondence). Eg. Count of corresponding words; quasicorrespondence in H1-neighborhoods, PageRank, Samples are pairs: L = f x i, q f x j, q, x i should appear before x j in the ranking for query q. 58

Ranking Relevance-Ranking of search results. Samples are pairs: L = f x i, q f x j, q, x i should appear before x j in the ranking for query q. Samples can be taken from click data. A user asks query q, receives a list of results, then clicks on the i th element in the list x i. Implicitly, he has rejected list element 1... i-1. For this user and query q, x i should have been placed first. 59

Ranking Relevance ranking of search results. Samples are pairs: L = f x i, q f x j, q, A user asks query q, receives a list of results, then clicks on the i th element in the list x i. For this user and query q, x i should have been placed first. From this encounter, we infer the samples: L q = f x i, q f x 1, q,, f x i, q f x i 1, q 60

Ranking Given: samples L = f x 1i1, q 1 f x 11, q 1,, f x 1i1, q 1 f x 1i1 1, q 1 f x mim, q m f x m1, q m,, f x mim, q m f x mim 1, q m Solution min w,ξ 1 2 w 2 + C ξ ji j i s. t. j i < i j w T Φ x jij, q j Φ x ji, q j 1 ξ ji j i ξ ji 0 61

Structured Classification - Summary Classification for more than two classes: y x = argmax y f θ x, y, f θ x, y = Φ x, y T θ Learning formulated as Multiclass SVM Structure of Y captured by Λ y giving Φ x, y = φ x Λ y & k x i, y i, x j, y j = k x i, x j Λ y i T Λ y j Structured input & output high dimensional Y fewer parameters to estimate from feature mapping efficient prediction from problem-specific encoding efficient learning algorithm cutting plane algorithm Other structured problems: Ranking uses order constraints 62