Adversarial Surrogate Losses for Ordinal Regression

Adversarial Surrogate Losses for Ordinal Regression Rizal Fathony, Mohammad Bashiri, Brian D. Ziebart (Illinois at Chicago) August 22, 2018 Presented by: Gregory P. Spell

Outline 1 Introduction Ordinal Regression 2 Adversarial Ordinal Regression Zero-Sum Game Interpretations 3 Experiments

Ordinal Regression Appropriate when data have discrete class labels with an inherent ordering Misclassification for nearby labels should be penalized less than dramatically incorrect misclassifications Ex: product ratings For a true label excellent, a predicted label good is penalized less than a predicted label poor.

Loss Function and Challenges Let ŷ, y P Y be the predicted and true labels, respectively The canonical ordinal regression loss is the absolute error: ŷ y Note that absolute error is non-convex and discontinuous Surrogate losses are used to approximate loss function Construct the loss matrix L whose entries Lŷ,y ŷ y

Expected Risk Minimization Formulation Probabilistic predictor: ˆP pŷ xq conditional distribution of predicted label ŷ given input x Expected loss evaluated on true distribution P px, yq: E X,Y P ; Ŷ X ˆP rl Ŷ,Y s ÿ P px, yq ˆP pŷ, xqlŷ,y (1) x,y,ŷ Typical Objective: construct ˆP pŷ xq to minimize expected loss using empirical distribution P px, yq from training data instead of P px, yq

Threshold Methods Define a real-valued ordinal response variable: ˆf fi w x, where w are feature weights learned as parameters of the model Introduce thresholds to partition real-line: θ 0 8 ă θ 1 ă θ 2 ă ă θ Y 1 ă θ Y 8 ŷ is assigned label j for θ j 1 ă ˆf ď θ j

Zero-Sum Game Predictor player chooses distribution ˆP pŷ xq to minimize loss Adversarial player chooses distribution q P pqy xq to maximize loss qp pqy xq is constrained to match feature-based statistics of data We then write the game as: ˇˇı min max E X P ; ˇˇŶ Ŷ X ˆP pŷ xq qp pqy xq ˆP ; Y q X P q Y q such that (2) E X P ; q Y X q Y rφpx, q Y qs φ E X,Y P rφpx, Y qs (3) where φ is a vector of feature-based statistics of the data

Feature Representations yx Ipy ď 1q Thresholded regression: φ th px, yq Ipy ď 2q. Ipy ď Y 1q Induces a shared vector of feature weights and a set of thresholds yx Ipy 1qx Multiclass representation: φ mc px, yq Ipy 2qx. Ipy Y 1qx Induces a set of class-specific feature weights

Constrained Cost-Sensitive Minimax 1 Assemble vector representations of conditional label distributions: ˆp xi r ˆP pŷ 1 x iq,..., ˆP pŷ Y x iqs T and similarly for qp xi Expected loss for a single input x: EŶ X ˆP ; q Y X q P LŶ, q Y ˆp T x Lqp x Theorem: the constrained cost-sensitive minimax game reduces to the expectation of many unconstrained minimax games: ı T min max E X P ˆp x Lqp x min E X P max min ˆp T ˆp x qp x w x L 1 x,w qp x (4) qp x ˆp x where w parameterizes the new game, which is characterized by matrix L 1 x,w : `L 1 x,w ŷ,qy L ŷ,qy ` w pφpx, qyq φpx, yqq 1 From Adversarial Cost-Sensitive Classification, Kaiser Asif, Wei Xing, Sima Behpour, and Brian D. Ziebart, 2015

Cost-Sensitive Loss Formulation Viewing ordinal regression as a cost-sensitive classification problem (using previous theorem), the authors transform the zero-sum game of equations 2, 3 to: min w ÿ i max min ˆp T x qp xi ˆp i L 1 x i,w qp x i (5) xi» fi f 1 f yi f Y f yi ` Y 1 L 1 f 1 f yi ` 1 f Y f yi ` Y 2ffi ffi x i,w.... ffi. fl f 1 f yi ` Y 1 f Y f yi where f j w φpx i, jq is a Lagrangian potential and w is a vector of Lagrangian model parameters (6)

Nash Equilibrium Theorem: An adversarial ordinal regression predictor is obtained by choosing parameters w that minimize empirical risk of the surrogate loss function: AL ord f j ` f l ` j l w px i, y i q max f yi j,lpt1,..., Y u 2 max j f j ` j 2 ` max l f l l 2 f yi This loss is derived by finding the Nash equilibrium of the game matrix defined by matrix L 1 x i,w (7)

Thresholded Regression Surrogate Loss For the threshold regression feature representation, the parameter vector includes a vector of feature weights w and thresholds θ k. The adversarial ordinal loss may be written as: AL ord th px i, y i q max j jpw x i ` 1q ` řkěj θ k 2 ř lpw x i 1q ` kěl θ k 2 ` max l y i w x i ÿ kěy i θ k (8) This may be interpreted as averaging label predictions for potentials w x ` 1 and w x 1

Thresholded Regression Example Example for thresholded regression in which predicted label is 4, and the surrogate loss is obtained using averaged potentials for class labels 5 and 2

Multiclass Ordinal Surrogate Loss In this representation, the parameter vector includes class-specific feature weights w j and the adversarial surrogate loss becomes: AL ord mc px i, y i q w j x i ` w l x i ` j l max w yi x i j,lpt1,..., Y u 2 (9) Can be interpreted as a maximization over Y p Y ` 1q{2 linear hyperplanes

Multiclass Surrogate Example Contours Contour plots of the loss over the space of potential differences ψ j fi f j f yi, with three classes

Experiments Datasets: bencharm ordinal regression datasets from the UCI Machine Learning repository Evaluate using the mean-absolute-error (MAE) and compare to methods that use hinge loss surrogates. Perform experiments using the original feature spaces of the datasets, as well as a kernelized feature space using a Gaussian RBF kernel

Original Feature Space Results Bolded values indicate the best performance or not significantly worse than best (under a paired t-test) Authors note that threshold-based models perform well for relatively small datasets, while multiclass-based models do for large datasets

Gaussian Kernel Features Results kernelized thresholded regression performs better than the other threshold-based models