CS-E4830 Kernel Methods in Machine Learning

Similar documents
ICS-E4030 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support vector machines

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machine (continued)

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machines for Classification and Regression

Classification and Support Vector Machine

An Introduction to Machine Learning

CS798: Selected topics in Machine Learning

Kernel Methods and Support Vector Machines

Announcements - Homework

CSC 411 Lecture 17: Support Vector Machine

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Review: Support vector machines. Machine learning techniques and image analysis

Lecture Support Vector Machine (SVM) Classifiers

Convex Optimization and Support Vector Machine

ECS289: Scalable Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support vector machines Lecture 4

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines, Kernel SVM

Machine Learning. Support Vector Machines. Manfred Huber

Support Vector Machines and Kernel Methods

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Learning by constraints and SVMs (2)

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Support Vector Machines: Maximum Margin Classifiers

Multiclass Classification-1

Support Vector Machines

Bits of Machine Learning Part 1: Supervised Learning

Warm up: risk prediction with logistic regression

Support Vector and Kernel Methods

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

Semi Supervised Distance Metric Learning

Support Vector Machine (SVM) and Kernel Methods

The Perceptron Algorithm, Margins

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Max Margin-Classifier

Linear & nonlinear classifiers

Kernel methods, kernel SVM and ridge regression

CS145: INTRODUCTION TO DATA MINING

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Pattern Recognition 2018 Support Vector Machines

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

6.036 midterm review. Wednesday, March 18, 15

MATH 829: Introduction to Data Mining and Analysis Support vector machines

Cutting Plane Training of Structural SVM

(Kernels +) Support Vector Machines

Support vector comparison machines

Support Vector Machines

Support Vector Machines (SVMs).

Lecture Notes on Support Vector Machine

ML4NLP Multiclass Classification

ECS289: Scalable Machine Learning

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Formulation with slack variables

ML (cont.): SUPPORT VECTOR MACHINES

Machine Learning A Geometric Approach

Constrained Optimization and Support Vector Machines

Lecture 10: A brief introduction to Support Vector Machine

COMP 875 Announcements

Support Vector Machine

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Advanced Topics in Machine Learning, Summer Semester 2012

Algorithms for Predicting Structured Data

Statistical Machine Learning from Data

Support Vector Machine (SVM) and Kernel Methods

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

SUPPORT VECTOR MACHINE

Support Vector Machine for Classification and Regression

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning for NLP

Lecture 18: Multiclass Support Vector Machines

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Linear & nonlinear classifiers

Machine Learning for NLP

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Support Vector Machines

Support Vector Machines for Classification: A Statistical Portrait

CS 188: Artificial Intelligence. Outline

Introduction to Support Vector Machines

Support Vector Machines

This is an author-deposited version published in : Eprints ID : 17710

Lecture 10: Support Vector Machine and Large Margin Classifier

Lecture 3: Multiclass Classification

Classification and Pattern Recognition

Learning From Data Lecture 25 The Kernel Trick

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Homework 4. Convex Optimization /36-725

Support Vector Machine

CS246 Final Exam, Winter 2011

Transcription:

CS-E4830 Kernel Methods in Machine Learning Lecture 5: Multi-class and preference learning Juho Rousu 11. October, 2017 Juho Rousu 11. October, 2017 1 / 37

Agenda from now on: This week s theme: going beyond binary classification Multiclass classification Ranking and preference learning Next week: learning with no or partial labels Novelty/Anomaly detection Semi-supervised learning Following week: Period break Themes in the section period: dimensionality reduction/component analysis, clustering, kernels for structured and heterogenous data, structured output Juho Rousu 11. October, 2017 2 / 37

Learning with multiple classes Multi-Class Classification Multiclass classification Given a training data set {(x i, y i ) X Y} l i=1 Outputs belongs to a set of possible classes or labels: y i Y = {1, 2,..., K} Three basic strategies to solve the problem: One-versus-all approach K distinct binary SVMs K optimization problems with l training examples each All-versus-all approach K(K 1)/2 distinct binary SVMs K(K 1)/2 optimization problem with 2l/K training examples each Combined model: learning to predict multiple classes directly Juho Rousu 11. October, 2017 3 / 37

Learning with multiple classes Multi-Class Classification One-versus-all multiclass SVM Training: Train K SVMs, one for each class, to separate the instances of that class from the rest The optimisation problem for k th class becomes (weight vector w (k), slack variables ξ k i, intercept b embedded in constant feature x 0): min 1 2 w (k) 2 + C l i=1 s.t. w (k) x i 1 ξ k i, if y i = k ξ k i w (k) x i 1 + ξ k i, if y i k ξ k i 0 Juho Rousu 11. October, 2017 4 / 37

Learning with multiple classes Multi-Class Classification One-versus-all multiclass SVM Prediction: For an input x, the predicted class ŷ(x) is the output of the classifier that gives the maximum margin w (k) x: ŷ(x) = argmax k {1,...,K} w (k) x Winner-take-all criterion: one class is chosen as the winner ( recipient of the data point ) Juho Rousu 11. October, 2017 5 / 37

Learning with multiple classes Multi-Class Classification Geometric interpretation Each classifier defines a half space w (k) x[+b k ] > 0 where it predicts membership in the class (blue lines) The regions of winning class form a partition of the feature space, with boundaries (red lines) satisfying w (k) x[+b k ] = w (h) x[+b h ] for a pair of most highly scoring classifiers k, h There may be regions of the feature space where no classifier has a positive score! then, the least negative classifier is the winner. + H1 - C2 + - H2 C1 - + H3 C3 Juho Rousu 11. October, 2017 6 / 37

Learning with multiple classes Multi-Class Classification One-versus-all multiclass SVM with Kernels Training one-versus-all multi class SVM is{ equivalent to training K SVMs with a surrogate class label ỹ (k) +1, if y i = k, i = 1, if y i k Pre-process the labels of the training data to the surrogate labels prior to training Can use dual SVM just as well. Train K SVMs with the surrogate labels ỹ (k), k = 1,..., K max l i=1 s.t.0 α (k) i α (k) i 1 2 l l i=1 j=1 α (k) i α (k) j C, i = 1,..., l ỹ (k) i ỹ (k) j κ(x i, x j ) Above α (k) i denotes the dual variables for the kth SVM and κ(x i, x j ) is the kernel (same kernel for all classifiers). Juho Rousu 11. October, 2017 7 / 37

Learning with multiple classes Multi-Class Classification All-versus-all multiclass SVM Second approach so called all-versus-all multiclass SVM Training: Build a binary classifier independently for each pair (k, h) of the K classes, where k h Each classifier is built with a subset of training data consisting of the instances of the two classes k and h: S kh = {(x i, y i ) S y i = k or y i = h} In total K(K 1)/2 binary classifiers are needed Denote the prediction for class pair (k, h) by y kh (x) Juho Rousu 11. October, 2017 8 / 37

Learning with multiple classes Multi-Class Classification All-versus-all multiclass SVM The optimisation problem for class pair (k, h) is given by (weight vector w (k,h), slack variables ξi kh, intercept b embedded in constant feature x 0 ): min 1 2 w (k,h) 2 + C l i=1 ξ kh i s.t. w (k,h) x i 1 ξi kh, if y i = k w (k,h) x i 1 + ξi kh, if y i = h ξi kh 0 Equivalent { to training K(K 1)/2 SVMs with a surrogate class labels ỹi kh +1, if y i = k, = 1, if y i = h Juho Rousu 11. October, 2017 9 / 37

Learning with multiple classes Multi-Class Classification All-versus-all multiclass SVM Prediction: For an input x, the predicted class ŷ(x) is obtained by evaluating x with all classifiers. Each classifier predicts a class: ŷ kh (x) {k, h} For each class k count the number of classifiers that predict it: n k (x) = h<k 1 {ŷ hk (x)=k} + k<h 1 {ŷ kh (x)=k} The predicted class is the one with the most predictions: ŷ(x) = argmax k n k (x) This is called the Max Wins -strategy (taking classifier evaluation as a game with one class winning, the other losing) Juho Rousu 11. October, 2017 10 / 37

Learning with multiple classes Multi-Class Classification Geometric interpretation A class k is predicted within a region of the feature space where the number of wins equal the maximum {x n k (x) = max h n h (x)} Geometrically, the region is defined by intersection (solid blue lines) of equally many half spaces H23 + - - C2 + H12 C1 C3 H13+ - H kh = {x w (k,h) x[+b kh ] > 0}, k < h Juho Rousu 11. October, 2017 11 / 37

Learning with multiple classes Multi-Class Classification Geometric interpretation Ties can occur when n k (x) = n h (x) for some k h (e.g. the triangle in the middle) Resolved e.g. picking a random class among the tied ones as the prediction It is also possible that some classes will not be assigned a region, if the other classes are more frequent everywhere (does not happen in the picture) H23 + - - C2 + H12 C1 C3 H13+ - Juho Rousu 11. October, 2017 12 / 37

Learning with multiple classes Multi-Class Classification Combined model 1 One-versus-all and All-versus-all models are trained independently for each class or class pair Training time scales linearly or quadratically in the number of classes = may constitute a bottleneck if the number of classes is high Independent training does not allow us to directly optimize a regularised loss functional Defining a combined model may be more efficient to train and also give better prediction results The following model has been implemented, e.g. within the SVM multiclass package by Thorsten Joachims (http://www.cs. cornell.edu/people/tj/svm_light/svm_multiclass.html) 1 Crammer & Singer, JMLR 2001 Juho Rousu 11. October, 2017 13 / 37

Learning with multiple classes Multi-Class Classification Combined model A combined model can be achieved as follows: min w,ξ 1 2 K w (k) 2 + C k=1 l i=1 s.t. w (y i ) x i w (k) x i 1 δ yi,k ξ i, for all k = 1,..., K and for all i = 1,..., l ξ i 0, i = 1,..., l The model will optimize K weight vectors simultaneously, one for each class Prediction is the highest scoring class: ŷ(x) = argmax k {1,...,K} w (k) x ξ i Juho Rousu 11. October, 2017 14 / 37

Learning with multiple classes Multi-Class Classification Combined model min w,ξ 1 2 K w (k) 2 + C k=1 l i=1 s.t. w (y i ) x i w (k) x i 1 δ yi,k ξ i, for all k = 1,..., K and for all i = 1,..., l ξ i 0, i = 1,..., l ξ i Constraints aim to push the score of the correct class y i above the scores of all other classes, with margin 1 δ yi,k { 1, y = y δ y,y = 0, y y is the Kronecker delta function Juho Rousu 11. October, 2017 15 / 37

Learning with multiple classes Multi-Class Classification Combined model min w,ξ 1 2 K w (k) 2 + C k=1 l i=1 s.t. w (y i ) x i w (k) x i 1 δ yi,k ξ i, for all k = 1,..., K and for all i = 1,..., l ξ i 0, i = 1,..., l ξ i Slack ξ i can be interpreted as a multiclass Hinge loss ( ( )) ξ i max 0, 1 w (y i ) x i max w (k) x i k y i Loss is incurred if highest scoring incorrect class has score difference (margin) less than 1 to the score of correct class Juho Rousu 11. October, 2017 16 / 37

Learning with multiple classes Multi-Class Classification Combined model with kernels 2 A dual form can be derived (details skipped here), with dual variables α = (α i,k ), i = 1..., l, k = 1,..., K: max i,k s.t. k α i,k (δ i,k 1) 1 2 α i,k = 0 K l α i,k α j,k κ(x i, x j ) k=1 i,j=1 α i,k 0, if y i k, α i,k C, if y i = k Model s prediction in dual form: ŷ(x) = argmax k=1,...,k l α i,k κ(x i, x) i=1 2 Hsu and Lin, 2002 Juho Rousu 11. October, 2017 17 / 37

Learning with multiple classes Multi-Class Classification Combined model with kernels max i,k s.t. k α i,k (δ i,k 1) 1 2 α i,k = 0 K l α i,k α j,k κ(x i, x j ) k=1 i,j=1 α i,k 0, if y i k α i,k C, if y i = k The optimisation problem is a QP with equality and inequality constraints Can be solved with algorithms that maintain the equality constraint by picking pairs of dual variables at a time Juho Rousu 11. October, 2017 18 / 37

Preference learning 3 Preferences play a key role in various fields of application: Social networks (facebook, google+,...) Recommender systems (Netflix,last.fm,...) Review web sites (tripadvisor,goodpubguide,...) Internet banner advertizing Electronic commerce (Amazon,...) Adaptive retrieval systems (e.g. Google personalized search) 3 Huellermeyer & Fuernkrantz, 2010 Juho Rousu 11. October, 2017 19 / 37

Preference learning Goal: learn a predictive preference model from observed preference information. Notation: A is preferred over B: A B, alternatively we can say A is ranked above B Examples: Document classification: Given a set of news documents labeled into topic categories, predict which categories might be preferred for new articles Content-based filtering: given a set of queries paired with relevant documents to the query, learn a model that predicts relevant documents for further queries (search engines) Collaborative filtering: given a set of music pieces with preferences of a set of users, predict the preferences for new pieces and users (last.fm concept) Juho Rousu 11. October, 2017 20 / 37

Representing preferences Juho Rousu 11. October, 2017 21 / 37

Regularized preference learning We wish to use the regularised learning framework for learning preferences l min L(y i, w x i ) + λ 2 w 2 i=1 We need to choose: A feature representation and model structure (how to represent inputs and outputs) A loss function l that measures the discrepancy between true preferences and predicted preferences A regulariser that discourages overfitting Put everything together so that the problem can be efficiently solved Juho Rousu 11. October, 2017 22 / 37

Example loss function: Kendall s distance Count the pairs that are inverted in the predicted ranking Let r(x) denotes the ground truth ranking of item x, r (x) the predicted ranking Kendall s distance is given by D K (r, r ) = {(j, l) r(x j ) > r(x l ) and r (x j ) < r (x l )} Takes values between D(r, r ) = 0 and D(r, r ) = n(n 1)/2, where n is the number of items Juho Rousu 11. October, 2017 23 / 37

Object ranking Application 1: Object ranking 4 4 Joachims, 2002 Juho Rousu 11. October, 2017 24 / 37

Object ranking Object ranking through regularised learning Aim is to learn a utility function for input objects that agrees with the preferences Linear scoring functions are assumed: f (x) = w x We use a ranking loss function to enforce this agreement Training data is a set of objects S = {x i } l i=1 and a set of preferences between the objects P = {x i x j x i, x j S} In the following we denote a pairwise preference x i x j P by the shorthand i j Juho Rousu 11. October, 2017 25 / 37

Object ranking RankSVM for object ranking RankSVM solves the following regularised learning problem: min 1 2 w 2 + C P {i j} ξ ij s.t. w x i w x j 1 ξ ij, for all i j We have only one weight vector, the norm is regularized (as in SVM) Aim to push the score of the preferred object x i above x j by a margin, with slack given to each pair Corresponds to minimising the average Hinge loss of inverted pairs (approximating Kendall s distance) L(P, w) = 1 P max(0, 1 w x i + w x j ) {i j} Juho Rousu 11. October, 2017 26 / 37

Object ranking RankSVM with kernels To derive the dual problem, denote by x ij = x i x j the difference vector of two feature vectors and rearrange the terms to get the standard form optimization problem: min w,ξ 1 2 w 2 + C P {i j} ξ ij s.t. 1 w x ij ξ ij 0, for all i j ξ ij 0, for all i j The Lagrangian will be L(w, ξ, α, β) = 1 2 w 2 + C P + {i j} ξ ij + {i j} ( α ij 1 w ) x ij ξ ij β ij ξ ij {i j} Juho Rousu 11. October, 2017 27 / 37

Object ranking RankSVM with kernels L(w, ξ, α, β) = 1 2 w 2 + C P ξ ij + {i j} + ( α ij 1 w ) x ij ξ ij β ij ξ ij {i j} {i j} Set derivatives w.r.t primal variables to zero w L(w, ξ, α, β) = w α ij x ij = 0 {i j} ξij L(w, ξ, α, β) = C P α ij β ij = 0 Juho Rousu 11. October, 2017 28 / 37

Object ranking RankSVM with kernels L(w, ξ, α, β) = 1 2 w 2 + C P ξ ij + {i j} + ( α ij 1 w ) x ij ξ ij β ij ξ ij {i j} {i j} Plug in w = {i j} α ij x ij, and C P α ij β ij = 0, for all i j We obtain the dual function: g(α) = α ij 1 2 {i j} {i j},{r s} α ij x ij x rs α rs which should be optimized with box constraints 0 α ij C P Juho Rousu 11. October, 2017 29 / 37

Object ranking RankSVM with kernels We get the dual RankSVM problem: max α g(α) = α ij 1 2 {i j} {i j},{r s} s.t.0 α ij C, for all i j P It is a constrained Quadratic Programme α ij x ij x rs α rs The inner product x ij x rs can be replaced with any kernel κ( x ij, x rs ) acting on the difference vectors x ij = x i x j The number of dual variables is proportional to the set of pairwise preferences, at worst quadratic in number of objects Juho Rousu 11. October, 2017 30 / 37

Ranking labels Application 2: Label ranking Training output are given as lists of pairwise preferences A B between labels: defines a partial order label A is preferable to label B Model ranks all labels: outputs a total order, that is, all possible labels given in sequential order Loss function is between two rankings: loss in incurred if the prediction has B A and the ground truth has A B Juho Rousu 11. October, 2017 31 / 37

Ranking labels Label ranking: definitions X is the input space, Σ = {1,..., K} is the set of labels (classes) Y is the output space of all possible partial orders over Σ S = {(x i, Y i )} l i=1, (x i, Y i ) X Y is a set of training examples Each Y Y is a set of pairwise preferences Y Σ Σ (p q) Y i denotes label p is preferable to label q given input x i The pairwise preferences can be represented as a preference graph G(x i ) = (V, E i ), where nodes V = Σ correspond to labels, and edges correspond to preference relations Y i of input x i : (p q) Y i = (p, q) E i Juho Rousu 11. October, 2017 32 / 37

Ranking labels Preference graph examples Juho Rousu 11. October, 2017 33 / 37

Ranking labels From multiclass SVM to label ranking Multiclass SVM model is relatively straightforward to convert to a label ranking model For an example (x, Y ), the scoring function for each label k = 1,..., K is given by w (k) x In multiclass classification, we only needed to make the correct class the top-ranked one Here we need to order all labels instead of just ranking the correct class to the top: w (p) x w (q) x if (p q) Y Juho Rousu 11. October, 2017 34 / 37

Ranking labels Label ranking SVM 5 Label ranking SVM is given by min λ 2 K w (k) 2 + k=1 l i=1 1 Y i {p q} Y i ξ pqi s.t. w (p) x i w (q) x i > 1 ξ pqi for all {p q} Y i, i = 1,..., l ξ pqi 0 Objective: Regularizes the sum of norms of all label classifiers (same as in multiclass learning) Slack is given for all examples and all class pairs that have specified partial order in the ground truth 5 Gärtner & Vembu, 2009 Juho Rousu 11. October, 2017 35 / 37

Ranking labels Label ranking SVM min λ 2 K w (k) 2 + k=1 l i=1 1 Y i {p q} Y i ξ pqi s.t. w (p) x i w (q) x i > 1 ξ pqi for all {p q} Y i, i = 1,..., l ξ pqi 0 Constraints: Correspond to minimizing the average of the Hinge losses over the label preferences: L(x, Y, w) = 1 max(0, 1 w (p) x w (q) x) Y p q Y Approximately corresponds to minimising the number of inverted pairs (Kendall s distance) Juho Rousu 11. October, 2017 36 / 37

Other preference learning setups Other preference learning setups Besides label and object ranking, may other preference learning setups can be tackled Calibrated label ranking: labels are both ranked and divided into classes (liked, disliked) Instance ranking: object preferences are given in an ordinal scale, e.g. (,, 0, +, ++) Pairwise learning setups: Collaborative filtering, Dyadic learning Juho Rousu 11. October, 2017 37 / 37