PAC-Bayesian Generalization Bound on Confusion Matrix for Multi-Class Classification

Similar documents
PAC-Bayesian Generalization Bound on Confusion Matrix for Multi-Class Classification

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

PAC-Bayes Analysis Of Maximum Entropy Learning

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Kernel Methods and Support Vector Machines

PAC-Bayesian Learning of Linear Classifiers

Combining Classifiers

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

1 Bounding the Margin

Computational and Statistical Learning Theory

Non-Parametric Non-Line-of-Sight Identification 1

Feature Extraction Techniques

PAC-Bayesian Generalization Bound for Multi-class Learning

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

COS 424: Interacting with Data. Written Exercises

Understanding Machine Learning Solution Manual

Support Vector Machines. Goals for the lecture

arxiv: v3 [stat.ml] 9 Aug 2016

1 Generalization bounds based on Rademacher complexity

Domain-Adversarial Neural Networks

Support Vector Machines. Maximizing the Margin

CS Lecture 13. More Maximum Likelihood

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Bayes Decision Rule and Naïve Bayes Classifier

1 Rademacher Complexity Bounds

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

arxiv: v3 [stat.ml] 13 Jul 2017

A Theoretical Analysis of a Warm Start Technique

Boosting with log-loss

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Ensemble Based on Data Envelopment Analysis

Multi-Dimensional Hegselmann-Krause Dynamics

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Simplified PAC-Bayesian Margin Bounds

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

1 Proof of learning bounds

Computational and Statistical Learning Theory

Supplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Fundamental Limits of Database Alignment

Soft-margin SVM can address linearly separable problems with outliers

The proofs of Theorem 1-3 are along the lines of Wied and Galeano (2013).

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are,

Polygonal Designs: Existence and Construction

Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Sample Guarantees

A Simple Regression Problem

Symmetrization and Rademacher Averages

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

Robustness and Regularization of Support Vector Machines

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS

Risk Bounds for the Majority Vote: From a PAC-Bayesian Analysis to a Learning Algorithm

Sharp Time Data Tradeoffs for Linear Inverse Problems

Probability Distributions

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Learnability of Gaussians with flexible variances

Block designs and statistics

Estimating Parameters for a Gaussian pdf

Pattern Recognition and Machine Learning. Artificial Neural networks

Randomized Recovery for Boolean Compressed Sensing

Geometrical intuition behind the dual problem

Shannon Sampling II. Connections to Learning Theory

Generalization of the PAC-Bayesian Theory

Learnability and Stability in the General Learning Setting

Lecture 21. Interior Point Methods Setup and Algorithm

Stochastic Subgradient Methods

Machine Learning Basics: Estimators, Bias and Variance

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine

Distributed Subgradient Methods for Multi-agent Optimization

Exact tensor completion with sum-of-squares

Pattern Recognition and Machine Learning. Artificial Neural networks

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

Lower Bounds for Quantized Matrix Completion

A note on the multiplication of sparse matrices

Chapter 6 1-D Continuous Groups

Stability Bounds for Non-i.i.d. Processes

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

La théorie PAC-Bayes en apprentissage supervisé

PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers

Ch 12: Variations on Backpropagation

Support Vector Machines MIT Course Notes Cynthia Rudin

A note on the realignment criterion

Computable Shell Decomposition Bounds

A Note on the Applied Use of MDL Approximations

A Theoretical Framework for Deep Transfer Learning

Deflation of the I-O Series Some Technical Aspects. Giorgio Rampa University of Genoa April 2007

On the Use of A Priori Information for Sparse Signal Approximations

Asynchronous Gossip Algorithms for Stochastic Optimization

Machine Learning: Fisher s Linear Discriminant. Lecture 05

Supplement to: Subsampling Methods for Persistent Homology

Variational Adaptive-Newton Method

ON THE TWO-LEVEL PRECONDITIONING IN LEAST SQUARES METHOD

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning

Semicircle law for generalized Curie-Weiss matrix ensembles at subcritical temperature

arxiv: v1 [cs.ds] 3 Feb 2014

Transcription:

PAC-Bayesian Generalization Bound on Confusion Matrix for Multi-Class Classification Eilie Morvant eilieorvant@lifuniv-rsfr okol Koço sokolkoco@lifuniv-rsfr Liva Ralaivola livaralaivola@lifuniv-rsfr Aix-Marseille Univ, QARMA, LIF, CNR, UMR 779, F-303, Marseille, France Abstract In this paper, we propose a PAC-Bayes bound for the generalization risk of the Gibbs classifier in the ulti-class classification fraework The novelty of our work is the critical use of the confusion atrix of a classifier as an error easure; this puts our contribution in the line of work aiing at dealing with perforance easure that are richer than ere scalar criterion such as the isclassification rate Thanks to very recent and beautiful results on atrix concentration inequalities, we derive two bounds showing that the true confusion risk of the Gibbs classifier is upper-bounded by its epirical risk plus a ter depending on the nuber of training exaples in each class To the best of our knowledge, this is the first PAC-Bayes bounds based on confusion atrices Introduction The PAC-Bayesian fraework, first introduced in McAllester, 999a, is an iportant field of research in learning theory It borrows ideas fro the philosophy of Bayesian inference and ixes the with techniques used in statistical approaches of learning Given a faily of classifiers F, the ingredients of a PAC-Bayesian bound are a prior distribution P over F, a learning saple and a posterior distribution Q over F Distribution P conveys soe prior belief on what are the best classifiers fro F prior any access to ; the classifiers expected to be the ost perforant for the classification task at hand therefore have the largest weights under P The posterior distribution Appearing in Proceedings of the 9 th International Conference on Machine Learning, Edinburgh, cotland, UK, 0 Copyright 0 by the authors/owners Q is learned/adjusted using the inforation provided by the training set The essence of PAC-Bayesian results is to bound the risk of the stochastic Gibbs classifier associated with Q Catoni, 004 When specialized to appropriate function spaces F and relevant failies of prior and posterior distributions, PAC-Bayes bounds can be used to characterize the error of different existing classification ethods, such as support vector achines Langford & hawe-taylor, 00; Langford, 005 PAC-Bayes bounds can also be used to derive new supervised learning algoriths For exaple, Lacasse et al 007 have introduced an elegant bound on the risk of the ajority vote, which holds for any space F This bound is used to derive an algorith, MinCq Laviolette et al, 0, which achieves epirical results on par with state-of-the-art ethods oe other iportant results are given in Catoni, 007; eegerr, 00; McAllester, 999b; Langford et al, 00 In this paper, we address the ulti-class classification proble Given the ability of PAC-Bayesian bounds to explain perforances of learning ethods, related contributions to our work are ulti-class forulations for the VMs, such as those of Weston & Watkins 998; Lee et al 004 and Craer & inger 00 As ajority vote ethods, we ay relate our work to boosting ulti-class extensions of AdaBoost Freund & chapire, 996, such as in the fraework of Mukherjee & chapire 0, and to the algoriths AdaBoostMH/AdaBoostMR chapire & inger, 999 and AMME Zhu et al, 009 The originality of our work is that we consider the confusion atrix of the Gibbs classifier as an error easure We believe that in the ulti-class fraework, it is ore relevant to consider the confusion atrix as the error easure than the ere isclassification error, which corresponds to the probability for soe classifier h to err for its prediction on x The inforation as to

what is the probability for an instance of class p to be classified into class q with p q by soe predictor is indeed crucial in soe applications think of the difference between false-negative and false-positive predictions in a diagnosis autoated syste To the best of our knowledge, we are the first to propose a generalization bound on the confusion atrix in the PAC- Bayesian fraework The result we propose heavily relies on a atrix concentration inequality for sus of rando atrices of Tropp 0 The rest of this paper is organized as follows ec introduces the setting of ulti-class learning and soe of the basic notation used throughout the paper ec 3 briefly recalls the folk PAC-Bayes bound as introduced in McAllester, 003 In ec 4, we present the ain contribution of this paper, our PAC-Bayes bound on the confusion atrix, followed by its proof in ec 5 We discuss soe future works in ec 6 etting and Notation This section presents the general setting that we consider and the different tools that we will ake use of General Proble etting We consider classification tasks over soe input space X The output space is Y,, Q}, where Q is the nuber of classes The learning saple is denoted by x i, y i } where each exaple is drawn iid fro a fixed but unknown probability distribution D over X Y ; D denotes the distribution of an - saple We consider the faily F Y X of classifiers, and P and Q will respectively refer to prior and posterior distributions over F Given the prior distribution P and the training set, learning ais at finding a posterior Q leading to good generalization The Kullback-Leibler divergence KLQ P between Q and P is KLQ P E f Q log Qf Pf ; signx if x 0 and otherwise; The indicator function Ix is equal to if x is true and 0 otherwise Conventions and Basics on Matrices Throughout the paper we consider only real-valued square atrices C of order Q the nuber of classes t C is the transpose of the atrix C, Id Q denotes the identity atrix of size Q and 0 is the zero atrix The results given in this paper are based on a concentration inequality of Tropp 0 for sus of rando self-adjoint atrices, which extends to general real-valued atrices thanks to the dilation technique Paulsen 00: the dilation C of atrix C is C 0 C t C 0 Notation refers to the operator or spectral nor It returns the axiu singular value of its arguent: C axλ ax C, λ in C}, 3 where λ ax and λ in are respectively the algebraic axiu and iniu singular value of C Note that dilation preserves spectral inforation, so λ ax C C C 4 3 The Usual PAC-Bayes Theore Here, we recall the ain PAC-Bayesian bound in the binary classification case as presented in McAllester, 003; eegerr, 00; Langford, 005, where the set of labels is Y, } The true risk Rf and the epirical error R f of f are ined as: Rf E x,y D Ifx y, R f Ifx i y i The learner s ai is to choose a posterior distribution Q on F such that the risk of the Q-weighted ajority vote also called the Bayes classifier B Q is as sall as possible B Q akes a prediction according to B Q x sign E f Q fx The true risk RB Q and the epirical error R B Q of the Bayes classifier are ined as the probability that it coits an error on an exaple: RB Q P x,y D B Q x y 5 However, the PAC-Bayes approach does not directly bound the risk of B Q Instead, it bounds the risk of the stochastic Gibbs classifier G Q which predicts the label of x X by first drawing f according to Q and then returning fx The true risk RG Q and the epirical error R G Q of G Q are therefore: RG Q E f Q Rf ; R G Q E f Q R f 6 Note that in this setting, we have RB Q RG Q We present the PAC-Bayes theore which gives a bound on the error of the stochastic Gibbs classifier

Theore For any D, any F, any P of support F, any δ 0,, we have, P D Q on F, kl R G Q, RG Q KLQ P + ln ξ δ, δ where kla, b a ln a a b + a ln b, and ξ i0 i i/ i i/ i We now provide a novel PAC-Bayes bound in the context of ulti-class classification by considering the confusion atrix as an error easure 4 Multiclass PAC-Bayes Bound 4 Definitions and etting As said earlier, we focus on ulti-class classification The output space is Y,, Q}, with Q > We only consider learning algoriths acting on learning saple x i, y i } where each exaple is drawn iid according to D, such that Q and yj for every class y j Y, where yj is the nuber of exaples of real class y j In the context of ulti-class classification, an error easure can be a perforance tool called confusion atrix We consider the classical inition of the confusion atrix based on conditional probabilities: it is inherent and desirable to iniize the effects of unbalanced classes Concretely, for a given classifier f F and a saple x i, y i } D, the epirical confusion atrix D f ˆd pq p,q Q of f is ined as follows: p, q, ˆdpq yi Ifx i qiy i p The true confusion atrix D f d pq p,q Q of f over D corresponds to: p, q, d pq E x yp I fx q P x,y D fx q y p If f correctly classifies every exaple of the saple, then all the eleents of the confusion atrix are 0, except for the diagonal ones which correspond to the correctly classified exaples Hence the ore there are non-zero eleents in a confusion atrix outside the diagonal, the ore the classifier is prone to err Recall that in a learning process the objective is to learn a classifier f F with a low true error ie with good generalization guarantees, we are thus only interested in the errors of f Our objective is then to find f leading to a confusion atrix with the ore zero eleents outside the diagonal ince the diagonal gives the conditional probabilities of correct predictions, we propose to consider a different kind of confusion atrix by discarding the diagonal values Then the only non-zero eleents of the new confusion atrix correspond to the exaples that are isclassified by f For all f F we ine the epirical and true confusion atrices of f by respectively C f ĉ pq p,q Q and C f c pq p,q Q such that for all p, q: 0 if q p ĉ pq 7 otherwise, c pq ˆd pq 0 if q p d pq P x,y D fx q y p otherwise 8 Note that if f correctly classifies every exaple of a given saple, then the epirical confusion atrix C f is equal to 0 iilarly, if f is a perfect classifier over the distribution D, then the true confusion atrix is equal to 0 Therefore a relevant task is to iniize the size of the confusion atrix, thus having a confusion atrix as close to 0 as possible 4 Main Result: Confusion PAC-Bayes Bound for the Gibbs Classifier Our ain result is a PAC-Bayes generalization bound over the Gibbs classifier G Q in this particular context, where the epirical and true error easures are respectively given by the confusion atrices fro 7 and 8 In this case, we can ine the true and the epirical confusion atrices of G Q respectively by: C G Q E f Q E D C f ; CG Q E f Q C f Given f Q and a saple D, our objective is to bound the difference between C G Q and C G Q, the true and epirical errors of the Gibbs classifier Reark that the error rate P fx y of a classifier f ight be directly coputed as the -nor of t C f p, where p is the vector of prior probabilities However, in our case, concentration inequalities are only available for the operator nor ince we have u Q u for any Q-diensional vector u, we have that P fx y Q C f op Thus trying to iniize the operator nor of C f is a relevant strategy to control the risk Here is our ain result, a bound on the operator nor of the difference between C G Q and C G Q Theore Let X R d be the input space, Y,, Q} the output space, D a distribution over X Y with D the distribution of a -saple and F a

faily of classifiers fro X to Y Then for every prior distribution P over F and any δ 0,, we have: P D Q on F, C G Q 8Q 8Q C G Q KLQ P + ln } δ, 4δ where in y,,q y is the inial nuber of exaples fro which belong to the sae class Proof Deferred to ection 5 Note that, for all y Y, we need the following hypothesis: y > 8, which is not too strong a liitation Finally, we rewrite Theore in order to provide a bound on the size C G Q Corollary We consider the hypothesis of the Theore We have: P D Q on F, C G Q C G Q + 8Q 8Q KLQ P + ln } δ 4δ Proof By application of the reverse triangle inequality A B A B to Theore Both Theore and Corollary yield a bound on the estiation through the operator nor of the true confusion atrix of the Gibbs classifier over the posterior distribution Q, though this is ore explicit in the corollary Let the nuber of classes Q be a constant, then the true risk is upper-bounded by the epirical risk of the Gibbs classifier and a ter depending on the nuber of training exaples, especially on the value which corresponds to the inial quantity of exaples that belong to the sae class This eans that the larger, the closer the epirical confusion atrix of the Gibbs classifier is to its true atrix 43 Upper Bound on the Risk of the Majority Vote Classifier We recall that the Bayes classifier B Q is well known as ajority vote classifier under a given posterior distribution Q In the ulticlass setting, B Q is such that for any exaple it returns the ajority class under the easure Q and we ine it as: B Q x argax c Y E f Q Ifx c 9 We ine the conditional Gibbs risk RG Q, p, q and Bayes risk RG Q, p, q as RG Q, p, q E x D yp E f Q Ifx q, 0 RB Q, p, q E x D yp I argax c Y gc, q where gc, q E f Q Ifx c q The forer is the p, q entry of C G Q if p q and the latter is the p, q entry of C B Q Proposition Let Q be the nuber of classes Then RB Q, p, q and RG Q, p, q are related by the following inequality : q, p, RB Q, p, q QRG Q, p, q Proof Given in Morvant et al, 0 This proposition iplies the following result on the confusion atrices associated to B Q and G Q Corollary Let Q be the nuber of class Then C B Q and C G Q are related by the following inequality: C B Q Q C G Q 3 Proof Given in Morvant et al, 0 5 Proof of Theore This section gives the foral proof of Theore We first introduce a concentration inequality for a su of rando square atrices This allows us to deduce the PAC-Bayes generalization bound for confusion atrices by following the sae three step process as the one given in McAllester, 003; eegerr, 00; Langford, 005 for the classic PAC-Bayesian bound 5 Concentration Inequality for the Confusion Matrix The ain result of our work is based on the following corollary of a result on the concentration inequality for a su of self-adjoint atrices given by Tropp 0 see Theore 3 in Appendix this theore generalizes Hoeffding s inequality to the case self-adjoint rando atrices The purpose of the following corollary is to restate Theore 3 so that it carries over to atrices that are not self-adjoint It is central to us to have such a result as the atrices we are dealing with, naely confusion atrices, are rarely syetric Corollary 3 Consider a finite sequence M i } of independent, rando, square atrices of order Q, and let a i } be a sequence of fixed scalars Assue that

each rando atrix satisfies E i M i 0 and M i a i alost surely Then, with σ i a i, we have, ɛ 0, P } ɛ M i ɛ Q exp 8σ 4 i Proof We want to verify the hypothesis given in Theore 3 in order to apply it Let M i } be a finite sequence of independent, rando, square atrices of order Q such that E i M i 0 and let a i } be a sequence of fixed scalars such that M i a i We consider the sequence M i } of rando self-adjoint atrices with diension Q By the inition of the dilation, we obtain E i M i 0 Fro Equation 4, the dilation preserves the spectral inforation Thus, on the one hand, we have: i M i λ ax M i λ ax M i On the other hand, we have: i M i M i λ ax Mi a i To assure the hypothesis M i A i, we need to find a suitable sequence of fixed self-adjoint atrices A i } of diension Q where refers to the seiinite order on self-adjoint atrices Indeed, it suffices to construct a diagonal atrix ined as λ ax Mi Id Q for ensuring M i λax Mi Id Q More precisely, since for every i we have λ ax Mi a i, we fix A i as a diagonal atrix with a i on the diagonal, ie A i a i Id Q, with i A i i a i σ Finally, we can invoke Theore 3 to obtain the concentration inequality 4 In order to ake use of this corollary, we rewrite the confusion atrices as sus of exaple-based confusion atrices That is, for each exaple x i, y i, we ine its epirical confusion atrix by C f i ĉ pq i p,q Q as follows: p, q, ĉ pq i 0 if q p Ifx qiy i p yi otherwise where yi is the nuber of exaples of class y i Y belonging to Given an exaple x i, y i, the exaple-based confusion atrix contains at ost one non zero-eleent when f isclassifies x i, y i In the sae way, when f correctly classifies x i, y i then the exaple-based confusion atrix is equal to 0 Our error easure is then C f Cf i, that is we penalize only when f errs i We further introduce the rando square atrices C f i : which verifies E i C f i 0 C f i C f i E D C f i, 5 We have yet to find a suitable a i for a given C f i Let λ axi be the axiu singular value of C f i It is easy to verify that λ axi yi Thus, for all i we fix a i equal to yi Finally, with the introduced notations, Corollary 3 leads to the following concentration inequality: P C f i ɛ } ɛ Q exp 8σ 6 This inequality 6 allows us to deonstrate our Theore by following the process of McAllester, 003; eegerr, 00; Langford, 005 5 Three tep Proof Of Our Bound First, thanks to the concentration inequality 6, we prove the following lea Lea Let Q be the size of C f i C f i E D C f i ined as in 5 Then the following bound holds for any δ 0, : P D E f P exp and C f 8σ 8σ } C f i Q 8σ δ δ Proof For readability reasons, we note C f C f i If Z is a real valued rando variable so that P Z z k exp ngz with gz non-negative, non-decreasing and k a constant, then P exp n gz ν in, kν n/n We apply this to the concentration inequality 6 Choosing gz z non-negative, z ɛ, n 8σ and k Q, we obtain the following result: } 8σ P exp 8σ C f ν in, Qν / 8σ 8σ Note that exp 8σ C f is always non-negative

Hence it allows us to copute its expectation as: 8σ E exp 8σ P exp 0 Q + C f 8σ } 8σ C f ν dν Qν / 8σ dν Q Q 8σ 8σ ν 8σ / 8σ Q 8σ For a given classifier f F, we have: E D 8σ exp 8σ C f Q 8σ 7 Then, if P is a probability distribution over F, Equation 7 iplies that: E D 8σ E f P exp 8σ C f Q 8σ 8 Using Markov s inequality, we obtain the result of the lea The second step to prove Theore is to use the shift given in McAllester, 003 We recall this result in the following lea Lea Donsker-Varadhan inequality Donsker & Varadhan 975 Given the Kullback-Leibler divergence KLQ P between two distributions P and Q and let g be a function, we have: E b Q gb KLQ P + ln E b Q expgb Proof ee McAllester, 003 Recall that C f and b C f C f i, Lea iplies: 8σ With gb 8σ b 8σ E f Q 8σ C f 8σ KLQ P+ln E f P exp 8σ C f 9 see Theore 4 in Appendix The KL-divergence is ined in Equation The last step that copletes the proof of Theore consists in applying the result we obtained in Lea to Equation 9 Then, we have: 8σ E f Q 8σ C f KLQ P+ln Q 8σ δ 0 ince g is clearly convex, we apply Jensen s inequality 3 to 0 Then, with probability at least δ over, and for every distribution Q on F, we have: E f Q C f 8σ 8σ KLQ P+ln Q 8σ δ ince C f C f i E D C f i, then the bound is quite siilar to the one given in Theore We present in the next section, the calculations leading to our PAC-Bayesian generalization bound 53 iplification We first copute the variance paraeter σ a i For that purpose, in ection 5 we showed that for each i,, }, we can choose a i yi, where y i is the class of the i-th exaple and yi is the nuber of exaples of class y i Thus we have: σ y i Q y i:y iy y Q y y For sake of siplification of Equation and since the ter on the right side of this equation is an increasing function with respect to σ, we propose to upper-bound σ : σ Q y Q y in y,,q y Let in y,,q y, then using Equation, we obtain the following bound fro Equation : E f Q C f 8Q 8Q Then: E f Q C f 8Q 8Q KLQ P + ln 4δ KLQ P + ln 4δ 3 It reains to replace C f C f i E D C f i Recall that C G Q E f Q E D C f and CG Q 3 see Theore 5 in Appendix

E f Q C f, we obtain: E f Q C f E f Q E f Q C f i C f i E D C f i E f Q C f E D C f E f Q C f E D C f E D C f i E f Q C f E f QE D C f C G Q C G Q 4 By substituting the left part of the inequality 3 with the ter 4, we find the bound of our Theore 6 Discussion and Future Work This work gives rise to any interesting questions, aong which the following ones oe future works will be focused on instantiating our bound given in Theore for specific ulti-class fraeworks, such as ulti-class VM and ulti-class boosting Taking advantage of our theore while using the confusion atrices, ay allow us to derive new generalization bounds for these ethods Additionally, we are interested in seeing how effective learning ethods ay be derived fro the risk bound we propose For instance, in the binary PAC-Bayes setting, the algorith MinCq proposed by Laviolette et al 0 iniizes a bound depending on the first two oents of the argin of the Q-weighted ajority vote Fro our Theore and with a siilar study, we would like to design a new ulti-class algorith and observe how sound such an algorith could be This would probably require the derivation of a Cantelli- Tchebycheff deviation inequality in the atrix case Besides, it ight be very interesting to see how the noncoutative/atrix concentration inequalities provided by Tropp 0 ight be of soe use for other kinds of learning proble such as ultilabel classification, label ranking probles or structured prediction issues Finally, the question of extending the present work to the analysis of algoriths learning possibly infinitediensional operators as proposed by Abernethy et al 009 is also very exciting 7 Conclusion In this paper, we propose a new PAC-Bayesian generalization bound that applies in the ulti-class classification setting The originality of our contribution is that we consider the confusion atrix as an error easure Coupled with the use of the operator nor on atrices, we are capable of providing generalization bound on the size of confusion atrix with the idea that the saller the nor of the confusion atrix of the learned classifier, the better it is for the classification task at hand The derivation of our result takes advantage of the concentration inequality proposed by Tropp 0 for the su of rando self-adjoint atrices, that we directly adapt to square atrices which are not self-adjoint The ain results are presented in Theore and Corollary The bound in Theore is given on the difference between the true risk of the Gibbs classifier and its epirical error While the one given in Corollary upper-bounds the risk of the Gibbs classifier by its epirical error An interesting point is that our bound depends on the inial quantity of training exaples belonging to the sae class, for a given nuber of classes If this value increases, ie if we have a lot of training exaples, then the epirical confusion atrix of the Gibbs classifier tends to be close to its true confusion atrix A point worth noting is that the bound varies as O/, which is a typical rate in bounds not using second-order inforation Last but not least, the present work has given rise to a few algorithic and theoretical questions that we have discussed in the previous section Appendix Theore 3 Concentration Inequality for Rando Matrices Tropp 0 Consider a finite sequence M i } of independant, rando, self-adjoint atrices with diension Q, and let A i } be a sequence of fixed self-adjoint atrices Assue that each rando atrix satisfies EM i 0 and M i A i alost surely Then, for all ɛ 0, } P λ ax M i ɛ i Q exp ɛ 8σ, where σ i A i and refers to the seiinite order on self-adjoint atrices Theore 4 Markov s inequality Let Z be a rando

variable and z 0, then: P Z z E Z z Theore 5 Jensen s inequality Let X be an integrable real-valued rando variable and g be a convex function, then: References fez EgZ Abernethy, Jacob, Bach, Francis, Evgeniou, Theodoros, and Vert, Jean-Philippe A new approach to collaborative filtering: Operator estiation with spectral regularization Journal of Machine Learning Research, 0:803 86, 009 Catoni, Olivier 4 gibbs estiators In tatistical Learning Theory and tochastic Optiization, volue 85, pp 35 pringer, 004 Catoni, Olivier PAC-bayesian supervised classification: The therodynaics of statistical learning ArXiv e-prints, 007 Craer, Koby and inger, Yora On the algorithic ipleentation of ulticlass kernel-based vector achines Journal of Machine Learning Research, :65 9, 00 Donsker, David and Varadhan, Asyptotic evaluation of certain arkov process expectations for large tie Counications on Pure and Applied Matheatics, 8, 975 Freund, Yoav and chapire, Robert E Experients with a new boosting algorith In In Proceedings of the International Conference on Machine Learning, pp 48 56, 996 Lacasse, Alexandre, Laviolette, François, Marchand, Mario, Gerain, Pascal, and Usunier, Nicolas PAC-bayes bounds for the risk of the ajority vote and the variance of the Gibbs classifier In Adv in Neural Processing ystes NIP, 007 Langford, John Tutorial on practical prediction theory for classification Journal of Machine Learning Research, 6:73 306, 005 Langford, John and hawe-taylor, John PAC-bayes & argins In Advances in Neural Inforation Processing ystes 5, pp 439 446 MIT Press, 00 Langford, John, eeger, Matthias, and Megiddo, Nirod An iproved predictive accuracy bound for averaging classifiers In Proc of the International Conference on Machine Learning, pp 90 97, 00 Laviolette, François, Marchand, Mario, and Roy, Jean- Francis Fro PAC-Bayes Bounds to Quadratic Progras for Majority Votes In Proc of the International Conference on Machine Learning, June 0 Lee, Y, Lin, Y, and Wahba, G Multicategory support vector achines, theory, and application to the classification of icroarray data and satellite radiance data Journal of the Aerican tatistical Association, 99:67 8, 004 McAllester, David A oe PAC-bayesian theores Machine Learning, 37:355 363, 999a McAllester, David A PAC-bayesian odel averaging In Proceedings of the annual conference on Coputational learning theory COLT, pp 64 70, 999b McAllester, David A iplified PAC-bayesian argin bounds In Proc of the annual conference on Coputational learning theory COLT, pp 03 5, 003 Morvant, Eilie, Koço, okol, and Ralaivola, Liva PAC-Bayesian Generalization Bound on Confusion Matrix for Multi-Class Classification Arxiv: http://arxivorg/abs/068, 0 Mukherjee, Indraneel and chapire, Robert E A theory of ulticlass boosting CoRR, abs/08989, 0 Paulsen, VI Copletely bounded aps and operator algebras Cabridge studies in advanced atheatics Cabridge University Press, 00 chapire, Robert E and inger, Yora Iproved boosting algoriths using confidence-rated predictions In Machine Learning, pp 80 9, 999 eegerr, Matthias PAC-bayesian generalization error bounds for gaussian process classification Journal of Machine Learning Research, 3:33 69, 00 Tropp, Joel A User-friendly tail bounds for sus of rando atrices Foundations of Coputational Matheatics, pp 46, 0 Weston, Jason and Watkins, Chris Multi-class support vector achines, 998 Zhu, Ji, Zou, Hui, Rosset, aharon, and Hastie, Trevor Multi-class adaboost, 009