CS-E4830 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning Lecture 5: Multi-class and preference learning Juho Rousu 11. October, 2017 Juho Rousu 11. October, 2017 1 / 37

Agenda from now on: This week s theme: going beyond binary classification Multiclass classification Ranking and preference learning Next week: learning with no or partial labels Novelty/Anomaly detection Semi-supervised learning Following week: Period break Themes in the section period: dimensionality reduction/component analysis, clustering, kernels for structured and heterogenous data, structured output Juho Rousu 11. October, 2017 2 / 37

Learning with multiple classes Multi-Class Classification Multiclass classification Given a training data set {(x i, y i ) X Y} l i=1 Outputs belongs to a set of possible classes or labels: y i Y = {1, 2,..., K} Three basic strategies to solve the problem: One-versus-all approach K distinct binary SVMs K optimization problems with l training examples each All-versus-all approach K(K 1)/2 distinct binary SVMs K(K 1)/2 optimization problem with 2l/K training examples each Combined model: learning to predict multiple classes directly Juho Rousu 11. October, 2017 3 / 37

Learning with multiple classes Multi-Class Classification One-versus-all multiclass SVM Training: Train K SVMs, one for each class, to separate the instances of that class from the rest The optimisation problem for k th class becomes (weight vector w (k), slack variables ξ k i, intercept b embedded in constant feature x 0): min 1 2 w (k) 2 + C l i=1 s.t. w (k) x i 1 ξ k i, if y i = k ξ k i w (k) x i 1 + ξ k i, if y i k ξ k i 0 Juho Rousu 11. October, 2017 4 / 37

Learning with multiple classes Multi-Class Classification One-versus-all multiclass SVM Prediction: For an input x, the predicted class ŷ(x) is the output of the classifier that gives the maximum margin w (k) x: ŷ(x) = argmax k {1,...,K} w (k) x Winner-take-all criterion: one class is chosen as the winner ( recipient of the data point ) Juho Rousu 11. October, 2017 5 / 37

Learning with multiple classes Multi-Class Classification Geometric interpretation Each classifier defines a half space w (k) x[+b k ] > 0 where it predicts membership in the class (blue lines) The regions of winning class form a partition of the feature space, with boundaries (red lines) satisfying w (k) x[+b k ] = w (h) x[+b h ] for a pair of most highly scoring classifiers k, h There may be regions of the feature space where no classifier has a positive score! then, the least negative classifier is the winner. + H1 - C2 + - H2 C1 - + H3 C3 Juho Rousu 11. October, 2017 6 / 37

Learning with multiple classes Multi-Class Classification One-versus-all multiclass SVM with Kernels Training one-versus-all multi class SVM is{ equivalent to training K SVMs with a surrogate class label ỹ (k) +1, if y i = k, i = 1, if y i k Pre-process the labels of the training data to the surrogate labels prior to training Can use dual SVM just as well. Train K SVMs with the surrogate labels ỹ (k), k = 1,..., K max l i=1 s.t.0 α (k) i α (k) i 1 2 l l i=1 j=1 α (k) i α (k) j C, i = 1,..., l ỹ (k) i ỹ (k) j κ(x i, x j ) Above α (k) i denotes the dual variables for the kth SVM and κ(x i, x j ) is the kernel (same kernel for all classifiers). Juho Rousu 11. October, 2017 7 / 37

Learning with multiple classes Multi-Class Classification All-versus-all multiclass SVM Second approach so called all-versus-all multiclass SVM Training: Build a binary classifier independently for each pair (k, h) of the K classes, where k h Each classifier is built with a subset of training data consisting of the instances of the two classes k and h: S kh = {(x i, y i ) S y i = k or y i = h} In total K(K 1)/2 binary classifiers are needed Denote the prediction for class pair (k, h) by y kh (x) Juho Rousu 11. October, 2017 8 / 37

Learning with multiple classes Multi-Class Classification All-versus-all multiclass SVM The optimisation problem for class pair (k, h) is given by (weight vector w (k,h), slack variables ξi kh, intercept b embedded in constant feature x 0 ): min 1 2 w (k,h) 2 + C l i=1 ξ kh i s.t. w (k,h) x i 1 ξi kh, if y i = k w (k,h) x i 1 + ξi kh, if y i = h ξi kh 0 Equivalent { to training K(K 1)/2 SVMs with a surrogate class labels ỹi kh +1, if y i = k, = 1, if y i = h Juho Rousu 11. October, 2017 9 / 37

Learning with multiple classes Multi-Class Classification All-versus-all multiclass SVM Prediction: For an input x, the predicted class ŷ(x) is obtained by evaluating x with all classifiers. Each classifier predicts a class: ŷ kh (x) {k, h} For each class k count the number of classifiers that predict it: n k (x) = h<k 1 {ŷ hk (x)=k} + k<h 1 {ŷ kh (x)=k} The predicted class is the one with the most predictions: ŷ(x) = argmax k n k (x) This is called the Max Wins -strategy (taking classifier evaluation as a game with one class winning, the other losing) Juho Rousu 11. October, 2017 10 / 37

Learning with multiple classes Multi-Class Classification Geometric interpretation A class k is predicted within a region of the feature space where the number of wins equal the maximum {x n k (x) = max h n h (x)} Geometrically, the region is defined by intersection (solid blue lines) of equally many half spaces H23 + - - C2 + H12 C1 C3 H13+ - H kh = {x w (k,h) x[+b kh ] > 0}, k < h Juho Rousu 11. October, 2017 11 / 37

Learning with multiple classes Multi-Class Classification Geometric interpretation Ties can occur when n k (x) = n h (x) for some k h (e.g. the triangle in the middle) Resolved e.g. picking a random class among the tied ones as the prediction It is also possible that some classes will not be assigned a region, if the other classes are more frequent everywhere (does not happen in the picture) H23 + - - C2 + H12 C1 C3 H13+ - Juho Rousu 11. October, 2017 12 / 37

Learning with multiple classes Multi-Class Classification Combined model 1 One-versus-all and All-versus-all models are trained independently for each class or class pair Training time scales linearly or quadratically in the number of classes = may constitute a bottleneck if the number of classes is high Independent training does not allow us to directly optimize a regularised loss functional Defining a combined model may be more efficient to train and also give better prediction results The following model has been implemented, e.g. within the SVM multiclass package by Thorsten Joachims (http://www.cs. cornell.edu/people/tj/svm_light/svm_multiclass.html) 1 Crammer & Singer, JMLR 2001 Juho Rousu 11. October, 2017 13 / 37

Learning with multiple classes Multi-Class Classification Combined model A combined model can be achieved as follows: min w,ξ 1 2 K w (k) 2 + C k=1 l i=1 s.t. w (y i ) x i w (k) x i 1 δ yi,k ξ i, for all k = 1,..., K and for all i = 1,..., l ξ i 0, i = 1,..., l The model will optimize K weight vectors simultaneously, one for each class Prediction is the highest scoring class: ŷ(x) = argmax k {1,...,K} w (k) x ξ i Juho Rousu 11. October, 2017 14 / 37

Learning with multiple classes Multi-Class Classification Combined model min w,ξ 1 2 K w (k) 2 + C k=1 l i=1 s.t. w (y i ) x i w (k) x i 1 δ yi,k ξ i, for all k = 1,..., K and for all i = 1,..., l ξ i 0, i = 1,..., l ξ i Constraints aim to push the score of the correct class y i above the scores of all other classes, with margin 1 δ yi,k { 1, y = y δ y,y = 0, y y is the Kronecker delta function Juho Rousu 11. October, 2017 15 / 37

Learning with multiple classes Multi-Class Classification Combined model min w,ξ 1 2 K w (k) 2 + C k=1 l i=1 s.t. w (y i ) x i w (k) x i 1 δ yi,k ξ i, for all k = 1,..., K and for all i = 1,..., l ξ i 0, i = 1,..., l ξ i Slack ξ i can be interpreted as a multiclass Hinge loss ( ( )) ξ i max 0, 1 w (y i ) x i max w (k) x i k y i Loss is incurred if highest scoring incorrect class has score difference (margin) less than 1 to the score of correct class Juho Rousu 11. October, 2017 16 / 37

Learning with multiple classes Multi-Class Classification Combined model with kernels 2 A dual form can be derived (details skipped here), with dual variables α = (α i,k ), i = 1..., l, k = 1,..., K: max i,k s.t. k α i,k (δ i,k 1) 1 2 α i,k = 0 K l α i,k α j,k κ(x i, x j ) k=1 i,j=1 α i,k 0, if y i k, α i,k C, if y i = k Model s prediction in dual form: ŷ(x) = argmax k=1,...,k l α i,k κ(x i, x) i=1 2 Hsu and Lin, 2002 Juho Rousu 11. October, 2017 17 / 37

Learning with multiple classes Multi-Class Classification Combined model with kernels max i,k s.t. k α i,k (δ i,k 1) 1 2 α i,k = 0 K l α i,k α j,k κ(x i, x j ) k=1 i,j=1 α i,k 0, if y i k α i,k C, if y i = k The optimisation problem is a QP with equality and inequality constraints Can be solved with algorithms that maintain the equality constraint by picking pairs of dual variables at a time Juho Rousu 11. October, 2017 18 / 37

Preference learning 3 Preferences play a key role in various fields of application: Social networks (facebook, google+,...) Recommender systems (Netflix,last.fm,...) Review web sites (tripadvisor,goodpubguide,...) Internet banner advertizing Electronic commerce (Amazon,...) Adaptive retrieval systems (e.g. Google personalized search) 3 Huellermeyer & Fuernkrantz, 2010 Juho Rousu 11. October, 2017 19 / 37

Preference learning Goal: learn a predictive preference model from observed preference information. Notation: A is preferred over B: A B, alternatively we can say A is ranked above B Examples: Document classification: Given a set of news documents labeled into topic categories, predict which categories might be preferred for new articles Content-based filtering: given a set of queries paired with relevant documents to the query, learn a model that predicts relevant documents for further queries (search engines) Collaborative filtering: given a set of music pieces with preferences of a set of users, predict the preferences for new pieces and users (last.fm concept) Juho Rousu 11. October, 2017 20 / 37

Representing preferences Juho Rousu 11. October, 2017 21 / 37

Regularized preference learning We wish to use the regularised learning framework for learning preferences l min L(y i, w x i ) + λ 2 w 2 i=1 We need to choose: A feature representation and model structure (how to represent inputs and outputs) A loss function l that measures the discrepancy between true preferences and predicted preferences A regulariser that discourages overfitting Put everything together so that the problem can be efficiently solved Juho Rousu 11. October, 2017 22 / 37

Example loss function: Kendall s distance Count the pairs that are inverted in the predicted ranking Let r(x) denotes the ground truth ranking of item x, r (x) the predicted ranking Kendall s distance is given by D K (r, r ) = {(j, l) r(x j ) > r(x l ) and r (x j ) < r (x l )} Takes values between D(r, r ) = 0 and D(r, r ) = n(n 1)/2, where n is the number of items Juho Rousu 11. October, 2017 23 / 37

Object ranking Application 1: Object ranking 4 4 Joachims, 2002 Juho Rousu 11. October, 2017 24 / 37

Object ranking Object ranking through regularised learning Aim is to learn a utility function for input objects that agrees with the preferences Linear scoring functions are assumed: f (x) = w x We use a ranking loss function to enforce this agreement Training data is a set of objects S = {x i } l i=1 and a set of preferences between the objects P = {x i x j x i, x j S} In the following we denote a pairwise preference x i x j P by the shorthand i j Juho Rousu 11. October, 2017 25 / 37

Object ranking RankSVM for object ranking RankSVM solves the following regularised learning problem: min 1 2 w 2 + C P {i j} ξ ij s.t. w x i w x j 1 ξ ij, for all i j We have only one weight vector, the norm is regularized (as in SVM) Aim to push the score of the preferred object x i above x j by a margin, with slack given to each pair Corresponds to minimising the average Hinge loss of inverted pairs (approximating Kendall s distance) L(P, w) = 1 P max(0, 1 w x i + w x j ) {i j} Juho Rousu 11. October, 2017 26 / 37

Object ranking RankSVM with kernels To derive the dual problem, denote by x ij = x i x j the difference vector of two feature vectors and rearrange the terms to get the standard form optimization problem: min w,ξ 1 2 w 2 + C P {i j} ξ ij s.t. 1 w x ij ξ ij 0, for all i j ξ ij 0, for all i j The Lagrangian will be L(w, ξ, α, β) = 1 2 w 2 + C P + {i j} ξ ij + {i j} ( α ij 1 w ) x ij ξ ij β ij ξ ij {i j} Juho Rousu 11. October, 2017 27 / 37

Object ranking RankSVM with kernels L(w, ξ, α, β) = 1 2 w 2 + C P ξ ij + {i j} + ( α ij 1 w ) x ij ξ ij β ij ξ ij {i j} {i j} Set derivatives w.r.t primal variables to zero w L(w, ξ, α, β) = w α ij x ij = 0 {i j} ξij L(w, ξ, α, β) = C P α ij β ij = 0 Juho Rousu 11. October, 2017 28 / 37

Object ranking RankSVM with kernels L(w, ξ, α, β) = 1 2 w 2 + C P ξ ij + {i j} + ( α ij 1 w ) x ij ξ ij β ij ξ ij {i j} {i j} Plug in w = {i j} α ij x ij, and C P α ij β ij = 0, for all i j We obtain the dual function: g(α) = α ij 1 2 {i j} {i j},{r s} α ij x ij x rs α rs which should be optimized with box constraints 0 α ij C P Juho Rousu 11. October, 2017 29 / 37

Object ranking RankSVM with kernels We get the dual RankSVM problem: max α g(α) = α ij 1 2 {i j} {i j},{r s} s.t.0 α ij C, for all i j P It is a constrained Quadratic Programme α ij x ij x rs α rs The inner product x ij x rs can be replaced with any kernel κ( x ij, x rs ) acting on the difference vectors x ij = x i x j The number of dual variables is proportional to the set of pairwise preferences, at worst quadratic in number of objects Juho Rousu 11. October, 2017 30 / 37

Ranking labels Application 2: Label ranking Training output are given as lists of pairwise preferences A B between labels: defines a partial order label A is preferable to label B Model ranks all labels: outputs a total order, that is, all possible labels given in sequential order Loss function is between two rankings: loss in incurred if the prediction has B A and the ground truth has A B Juho Rousu 11. October, 2017 31 / 37

Ranking labels Label ranking: definitions X is the input space, Σ = {1,..., K} is the set of labels (classes) Y is the output space of all possible partial orders over Σ S = {(x i, Y i )} l i=1, (x i, Y i ) X Y is a set of training examples Each Y Y is a set of pairwise preferences Y Σ Σ (p q) Y i denotes label p is preferable to label q given input x i The pairwise preferences can be represented as a preference graph G(x i ) = (V, E i ), where nodes V = Σ correspond to labels, and edges correspond to preference relations Y i of input x i : (p q) Y i = (p, q) E i Juho Rousu 11. October, 2017 32 / 37

Ranking labels Preference graph examples Juho Rousu 11. October, 2017 33 / 37

Ranking labels From multiclass SVM to label ranking Multiclass SVM model is relatively straightforward to convert to a label ranking model For an example (x, Y ), the scoring function for each label k = 1,..., K is given by w (k) x In multiclass classification, we only needed to make the correct class the top-ranked one Here we need to order all labels instead of just ranking the correct class to the top: w (p) x w (q) x if (p q) Y Juho Rousu 11. October, 2017 34 / 37

Ranking labels Label ranking SVM 5 Label ranking SVM is given by min λ 2 K w (k) 2 + k=1 l i=1 1 Y i {p q} Y i ξ pqi s.t. w (p) x i w (q) x i > 1 ξ pqi for all {p q} Y i, i = 1,..., l ξ pqi 0 Objective: Regularizes the sum of norms of all label classifiers (same as in multiclass learning) Slack is given for all examples and all class pairs that have specified partial order in the ground truth 5 Gärtner & Vembu, 2009 Juho Rousu 11. October, 2017 35 / 37

Ranking labels Label ranking SVM min λ 2 K w (k) 2 + k=1 l i=1 1 Y i {p q} Y i ξ pqi s.t. w (p) x i w (q) x i > 1 ξ pqi for all {p q} Y i, i = 1,..., l ξ pqi 0 Constraints: Correspond to minimizing the average of the Hinge losses over the label preferences: L(x, Y, w) = 1 max(0, 1 w (p) x w (q) x) Y p q Y Approximately corresponds to minimising the number of inverted pairs (Kendall s distance) Juho Rousu 11. October, 2017 36 / 37

Other preference learning setups Other preference learning setups Besides label and object ranking, may other preference learning setups can be tackled Calibrated label ranking: labels are both ranked and divided into classes (liked, disliked) Instance ranking: object preferences are given in an ordinal scale, e.g. (,, 0, +, ++) Pairwise learning setups: Collaborative filtering, Dyadic learning Juho Rousu 11. October, 2017 37 / 37