Adversarial Surrogate Losses for Ordinal Regression

Similar documents
arxiv: v1 [stat.ml] 18 Dec 2018

Adversarial Surrogate Losses for Ordinal Regression

Statistical Machine Learning Hilary Term 2018

Warm up: risk prediction with logistic regression

Support Vector Machines

Learning with Rejection

Classification objectives COMS 4771

Homework 6. Due: 10am Thursday 11/30/17

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Kernel Methods and Support Vector Machines

CS229 Supplemental Lecture notes

MLCC 2017 Regularization Networks I: Linear Models

Announcements - Homework

Bits of Machine Learning Part 1: Supervised Learning

Machine Learning and Data Mining. Linear classification. Kalev Kask

Jeff Howbert Introduction to Machine Learning Winter

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Introduction to Machine Learning

Statistical Data Mining and Machine Learning Hilary Term 2016

Distribution-Free Distribution Regression

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

1-bit Matrix Completion. PAC-Bayes and Variational Approximation

Support vector machines Lecture 4

Lecture 2 Machine Learning Review

Linear discriminant functions

Evaluation. Andrea Passerini Machine Learning. Evaluation

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Evaluation requires to define performance measures to be optimized

Max Margin-Classifier

Classification and Support Vector Machine

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Reproducing Kernel Hilbert Spaces

Multiclass and Introduction to Structured Prediction

Linear & nonlinear classifiers

1 Machine Learning Concepts (16 points)

Machine Learning And Applications: Supervised Learning-SVM

Multiclass and Introduction to Structured Prediction

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Least Squares Regression

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Support Vector Machine (SVM) and Kernel Methods

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Minimax risk bounds for linear threshold functions

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Support Vector Machines

Ch 4. Linear Models for Classification

Reproducing Kernel Hilbert Spaces

Convex Optimization in Classification Problems

Cutting Plane Training of Structural SVM

Linear Models in Machine Learning

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

Support Vector Machines for Classification and Regression

Linear & nonlinear classifiers

Machine learning - HT Maximum Likelihood

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machine (continued)

LECTURE 7 Support vector machines

Least Squares Regression

CS489/698: Intro to ML

Variational inequality formulation of chance-constrained games

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear Methods for Classification

Machine Learning Practice Page 2 of 2 10/28/13

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS-E4830 Kernel Methods in Machine Learning

Bayesian Machine Learning

Machine Learning Linear Classification. Prof. Matteo Matteucci

Support Vector Machines for Classification: A Statistical Portrait

1-bit Matrix Completion. PAC-Bayes and Variational Approximation

Reproducing Kernel Hilbert Spaces

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Introduction to Support Vector Machines

Machine Learning Lecture 5

Logistic regression and linear classifiers COMS 4771

Advanced Introduction to Machine Learning CMU-10715

Lecture 3: Statistical Decision Theory (Part II)

Support Vector Machine (SVM) and Kernel Methods

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Computing regularization paths for learning multiple kernels

Distributed Gaussian Processes

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Part 2: Generalized output representations and structure

Sufficient Dimension Reduction using Support Vector Machine and it s variants

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Adversarial Inverse Optimal Control for General Imitation Learning Losses and Embodiment Transfer

A DARK GREY P O N T, with a Switch Tail, and a small Star on the Forehead. Any

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Support Vector Machines

Recap from previous lecture

Generative v. Discriminative classifiers Intuition

Logistic Regression. Machine Learning Fall 2018

Multiclass Classification

Transcription:

Adversarial Surrogate Losses for Ordinal Regression Rizal Fathony, Mohammad Bashiri, Brian D. Ziebart (Illinois at Chicago) August 22, 2018 Presented by: Gregory P. Spell

Outline 1 Introduction Ordinal Regression 2 Adversarial Ordinal Regression Zero-Sum Game Interpretations 3 Experiments

Ordinal Regression Appropriate when data have discrete class labels with an inherent ordering Misclassification for nearby labels should be penalized less than dramatically incorrect misclassifications Ex: product ratings For a true label excellent, a predicted label good is penalized less than a predicted label poor.

Loss Function and Challenges Let ŷ, y P Y be the predicted and true labels, respectively The canonical ordinal regression loss is the absolute error: ŷ y Note that absolute error is non-convex and discontinuous Surrogate losses are used to approximate loss function Construct the loss matrix L whose entries Lŷ,y ŷ y

Expected Risk Minimization Formulation Probabilistic predictor: ˆP pŷ xq conditional distribution of predicted label ŷ given input x Expected loss evaluated on true distribution P px, yq: E X,Y P ; Ŷ X ˆP rl Ŷ,Y s ÿ P px, yq ˆP pŷ, xqlŷ,y (1) x,y,ŷ Typical Objective: construct ˆP pŷ xq to minimize expected loss using empirical distribution P px, yq from training data instead of P px, yq

Threshold Methods Define a real-valued ordinal response variable: ˆf fi w x, where w are feature weights learned as parameters of the model Introduce thresholds to partition real-line: θ 0 8 ă θ 1 ă θ 2 ă ă θ Y 1 ă θ Y 8 ŷ is assigned label j for θ j 1 ă ˆf ď θ j

Zero-Sum Game Predictor player chooses distribution ˆP pŷ xq to minimize loss Adversarial player chooses distribution q P pqy xq to maximize loss qp pqy xq is constrained to match feature-based statistics of data We then write the game as: ˇˇı min max E X P ; ˇˇŶ Ŷ X ˆP pŷ xq qp pqy xq ˆP ; Y q X P q Y q such that (2) E X P ; q Y X q Y rφpx, q Y qs φ E X,Y P rφpx, Y qs (3) where φ is a vector of feature-based statistics of the data

Feature Representations yx Ipy ď 1q Thresholded regression: φ th px, yq Ipy ď 2q. Ipy ď Y 1q Induces a shared vector of feature weights and a set of thresholds yx Ipy 1qx Multiclass representation: φ mc px, yq Ipy 2qx. Ipy Y 1qx Induces a set of class-specific feature weights

Constrained Cost-Sensitive Minimax 1 Assemble vector representations of conditional label distributions: ˆp xi r ˆP pŷ 1 x iq,..., ˆP pŷ Y x iqs T and similarly for qp xi Expected loss for a single input x: EŶ X ˆP ; q Y X q P LŶ, q Y ˆp T x Lqp x Theorem: the constrained cost-sensitive minimax game reduces to the expectation of many unconstrained minimax games: ı T min max E X P ˆp x Lqp x min E X P max min ˆp T ˆp x qp x w x L 1 x,w qp x (4) qp x ˆp x where w parameterizes the new game, which is characterized by matrix L 1 x,w : `L 1 x,w ŷ,qy L ŷ,qy ` w pφpx, qyq φpx, yqq 1 From Adversarial Cost-Sensitive Classification, Kaiser Asif, Wei Xing, Sima Behpour, and Brian D. Ziebart, 2015

Cost-Sensitive Loss Formulation Viewing ordinal regression as a cost-sensitive classification problem (using previous theorem), the authors transform the zero-sum game of equations 2, 3 to: min w ÿ i max min ˆp T x qp xi ˆp i L 1 x i,w qp x i (5) xi» fi f 1 f yi f Y f yi ` Y 1 L 1 f 1 f yi ` 1 f Y f yi ` Y 2ffi ffi x i,w.... ffi. fl f 1 f yi ` Y 1 f Y f yi where f j w φpx i, jq is a Lagrangian potential and w is a vector of Lagrangian model parameters (6)

Nash Equilibrium Theorem: An adversarial ordinal regression predictor is obtained by choosing parameters w that minimize empirical risk of the surrogate loss function: AL ord f j ` f l ` j l w px i, y i q max f yi j,lpt1,..., Y u 2 max j f j ` j 2 ` max l f l l 2 f yi This loss is derived by finding the Nash equilibrium of the game matrix defined by matrix L 1 x i,w (7)

Thresholded Regression Surrogate Loss For the threshold regression feature representation, the parameter vector includes a vector of feature weights w and thresholds θ k. The adversarial ordinal loss may be written as: AL ord th px i, y i q max j jpw x i ` 1q ` řkěj θ k 2 ř lpw x i 1q ` kěl θ k 2 ` max l y i w x i ÿ kěy i θ k (8) This may be interpreted as averaging label predictions for potentials w x ` 1 and w x 1

Thresholded Regression Example Example for thresholded regression in which predicted label is 4, and the surrogate loss is obtained using averaged potentials for class labels 5 and 2

Multiclass Ordinal Surrogate Loss In this representation, the parameter vector includes class-specific feature weights w j and the adversarial surrogate loss becomes: AL ord mc px i, y i q w j x i ` w l x i ` j l max w yi x i j,lpt1,..., Y u 2 (9) Can be interpreted as a maximization over Y p Y ` 1q{2 linear hyperplanes

Multiclass Surrogate Example Contours Contour plots of the loss over the space of potential differences ψ j fi f j f yi, with three classes

Experiments Datasets: bencharm ordinal regression datasets from the UCI Machine Learning repository Evaluate using the mean-absolute-error (MAE) and compare to methods that use hinge loss surrogates. Perform experiments using the original feature spaces of the datasets, as well as a kernelized feature space using a Gaussian RBF kernel

Original Feature Space Results Bolded values indicate the best performance or not significantly worse than best (under a paired t-test) Authors note that threshold-based models perform well for relatively small datasets, while multiclass-based models do for large datasets

Gaussian Kernel Features Results kernelized thresholded regression performs better than the other threshold-based models