Structured Prediction Theory Based on Factor Graph Complexity

Size: px

Start display at page:

Download "Structured Prediction Theory Based on Factor Graph Complexity"

Loren McDowell
6 years ago
Views:

1 Structured Prediction Theory Based on Factor Graph Coplexity Corinna Cortes Google Research New York, NY 00 Mehryar Mohri Courant Institute and Google New York, NY 00 Vitaly Kuznetsov Google Research New York, NY 00 Scott Yang Courant Institute New York, NY 00 Abstract We present a general theoretical analysis of structured prediction with a series of new results We give new data-dependent argin guarantees for structured prediction for a very wide faily of loss functions and a general faily of hypotheses, with an arbitrary factor graph decoposition These are the tightest argin bounds known for both standard ulti-class and general structured prediction probles Our guarantees are expressed in ters of a data-dependent coplexity easure, factor graph coplexity, which we show can be estiated fro data and bounded in ters of failiar quantities for several coonly used hypothesis sets along with a sparsity easure for features and graphs Our proof techniques include generalizations of Talagrand s contraction lea that can be of independent interest We further extend our theory by leveraging the principle of Voted Risk Miniization VRM) and show that learning is possible even with coplex factor graphs We present new learning bounds for this advanced setting, which we use to design two new algoriths, Voted Conditional Rando Field VCRF) and Voted Structured Boosting StructBoost) These algoriths can ake use of coplex features and factor graphs and yet benefit fro favorable learning guarantees We also report the results of experients with VCRF on several datasets to validate our theory Introduction Structured prediction covers a broad faily of iportant learning probles These include key tasks in natural language processing such as part-of-speech tagging, parsing, achine translation, and naed-entity recognition, iportant areas in coputer vision such as iage segentation and object recognition, and also crucial areas in speech processing such as pronunciation odeling and speech recognition In all these probles, the output space adits soe structure This ay be a sequence of tags as in part-of-speech tagging, a parse tree as in context-free parsing, an acyclic graph as in dependency parsing, or labels of iage segents as in object detection Another property coon to these tasks is that, in each case, the natural loss function adits a decoposition along the output substructures As an exaple, the loss function ay be the Haing loss as in part-of-speech tagging, or it ay be the edit-distance, which is widely used in natural language and speech processing 9th Conference on Neural Inforation Processing Systes NIPS 06), Barcelona, Spain

2 The output structure and corresponding loss function ake these probles significantly different fro the unstructured) binary classification probles extensively studied in learning theory In recent years, a nuber of different algoriths have been designed for structured prediction, including Conditional Rando Field CRF) Lafferty et al, 00, StructSVM Tsochantaridis et al, 005, Maxiu-Margin Markov Network M3N) Taskar et al, 003, a kernel-regression algorith Cortes et al, 007, and search-based approaches such as Daué III et al, 009, Doppa et al, 04, La et al, 05, Chang et al, 05, Ross et al, 0 More recently, deep learning techniques have also been developed for tasks including part-of-speech tagging Jurafsky and Martin, 009, Vinyals et al, 05a, naed-entity recognition Nadeau and Sekine, 007, achine translation Zhang et al, 008, iage segentation Lucchi et al, 03, and iage annotation Vinyals et al, 05b However, in contrast to the plethora of algoriths, there have been relatively few studies devoted to the theoretical understanding of structured prediction Bakir et al, 007 Existing learning guarantees hold priarily for siple losses such as the Haing loss Taskar et al, 003, Cortes et al, 04, Collins, 00 and do not cover other natural losses such as the edit-distance They also typically only apply to specific factor graph odels The ain exception is the work of McAllester 007, which provides PAC-Bayesian guarantees for arbitrary losses, though only in the special case of randoized algoriths using linear count-based) hypotheses This paper presents a general theoretical analysis of structured prediction with a series of new results We give new data-dependent argin guarantees for structured prediction for a broad faily of loss functions and a general faily of hypotheses, with an arbitrary factor graph decoposition These are the tightest argin bounds known for both standard ulti-class and general structured prediction probles For special cases studied in the past, our learning bounds atch or iprove upon the previously best bounds see Section 33) In particular, our bounds iprove upon those of Taskar et al 003 Our guarantees are expressed in ters of a data-dependent coplexity easure, factor graph coplexity, which we show can be estiated fro data and bounded in ters of failiar quantities for several coonly used hypothesis sets along with a sparsity easure for features and graphs We further extend our theory by leveraging the principle of Voted Risk Miniization VRM) and show that learning is possible even with coplex factor graphs We present new learning bounds for this advanced setting, which we use to design two new algoriths, Voted Conditional Rando Field VCRF) and Voted Structured Boosting StructBoost) These algoriths can ake use of coplex features and factor graphs and yet benefit fro favorable learning guarantees As a proof of concept validating our theory, we also report the results of experients with VCRF on several datasets The paper is organized as follows In Section we introduce the notation and definitions relevant to our discussion of structured prediction In Section 3, we derive a series of new learning guarantees for structured prediction, which are then used to prove the VRM principle in Section 4 Section 5 develops the algorithic fraework which is directly based on our theory In Section 6, we provide soe preliinary experiental results that serve as a proof of concept for our theory Preliinaries Let X denote the input space and Y the output space In structured prediction, the output space ay be a set of sequences, iages, graphs, parse trees, lists, or soe other typically discrete) objects aditting soe possibly overlapping structure Thus, we assue that the output structure can be decoposed into l substructures For exaple, this ay be positions along a sequence, so that the output space Y is decoposable along these substructures: Y = Y Y l Here, Y k is the set of possible labels or classes) that can be assigned to substructure k Loss functions We denote by L: Y Y R + a loss function easuring the dissiilarity of two eleents of the output space Y We will assue that the loss function L is definite, that is Ly, y ) = 0 iff y = y This assuption holds for all loss functions coonly used in structured prediction A key aspect of structured prediction is that the loss function can be decoposed along the substructures Y k As an exaple, L ay be the Haing loss defined by Ly, y ) = l l k= y k y k for all y = y,, y l ) and y = y,, y l ), with y k, y k Y k In the coon case where Y is a set of sequences defined over a finite alphabet, L ay be the edit-distance, which is widely used in natural language and speech processing applications, with possibly different costs associated to insertions, deletions and substitutions L ay also be a loss based on the negative inner product of the vectors of n-gra counts of two sequences, or its negative logarith Such losses have been

3 f f f 3 f a) Figure : Exaple of factor graphs a) Pairwise Markov network decoposition: hx, y) = h f x, y, y )+h f x, y, y 3 ) b) Other decoposition hx, y) = h f x, y, y 3 )+h f x, y, y, y 3 ) b) 3 used to approxiate the BLEU score loss in achine translation There are other losses defined in coputational biology based on various string-siilarity easures Our theoretical analysis is general and applies to arbitrary bounded and definite loss functions Scoring functions and factor graphs We will adopt the coon approach in structured prediction where predictions are based on a scoring function apping X Y to R Let H be a faily of scoring functions For any h H, we denote by h the predictor defined by h: for any x X, hx) = argax y Y hx, y) Furtherore, we will assue, as is standard in structured prediction, that each function h H can be decoposed as a su We will consider the ost general case for such decopositions, which can be ade explicit using the notion of factor graphs A factor graph G is a tuple G = V, F, E), where V is a set of variable nodes, F a set of factor nodes, and E a set of undirected edges between a variable node and a factor node In our context, V can be identified with the set of substructure indices, that is V = {,, l} For any factor node f, denote by Nf) V the set of variable nodes connected to f via an edge and define Y f as the substructure set cross-product Y f = k Nf) Y k Then, h adits the following decoposition as a su of functions h f, each taking as arguent an eleent of the input space x X and an eleent of Y f, y f Y f : hx, y) = f F h f x, y f ) ) Figure illustrates this definition with two different decopositions More generally, we will consider the setting in which a factor graph ay depend on a particular exaple x i, y i ): Gx i, y i ) = G i = l i, F i, E i ) A special case of this setting is for exaple when the size l i or length) of each exaple is allowed to vary and where the nuber of possible labels Y is potentially infinite We present other exaples of such hypothesis sets and their decoposition in Section 3, where we discuss our learning guarantees Note that such hypothesis sets H with an itive decoposition are those coonly used in ost structured prediction algoriths Tsochantaridis et al, 005, Taskar et al, 003, Lafferty et al, 00 This is largely otivated by the coputational requireent for efficient training and inference Our results, while very general, further provide a statistical learning otivation for such decopositions Learning scenario We consider the failiar supervised learning scenario where the training and test points are drawn iid according to soe distribution D over X Y We will further adopt the standard definitions of argin, generalization error and epirical error The argin h x, y) of a hypothesis h for a labeled exaple x, y) X Y is defined by h x, y) = hx, y) ax y y hx, y ) ) Let S = x, y ),, x, y )) be a training saple of size drawn fro D We denote by Rh) the generalization error and by R S h) the epirical error of h over S: Rh) = E Lhx), y) and RS h) = E Lhx), y), 3) x,y) D x,y) S Factor graphs are typically used to indicate the factorization of a probabilistic odel We are not assuing probabilistic odels, but they would be also captured by our general fraework: h would then be - log of a probability 3

4 where hx) = argax y hx, y) and where the notation x, y) S indicates that x, y) is drawn according to the epirical distribution defined by S The learning proble consists of using the saple S to select a hypothesis h H with sall expected loss Rh) Observe that the definiteness of the loss function iplies, for all x X, the following equality: Lhx), y) = Lhx), y) h x,y) 0 4) We will later use this identity in the derivation of surrogate loss functions 3 General learning bounds for structured prediction In this section, we present new learning guarantees for structured prediction Our analysis is general and applies to the broad faily of definite and bounded loss functions described in the previous section It is also general in the sense that it applies to general hypothesis sets and not just sub-failies of linear functions For linear hypotheses, we will give a ore refined analysis that holds for arbitrary nor-p regularized hypothesis sets The theoretical analysis of structured prediction is ore coplex than for classification since, by definition, it depends on the properties of the loss function and the factor graph These attributes capture the cobinatorial properties of the proble which ust be exploited since the total nuber of labels is often exponential in the size of that graph To tackle this proble, we first introduce a new coplexity tool 3 Coplexity easure A key ingredient of our analysis is a new data-dependent notion of coplexity that extends the classical Radeacher coplexity We define the epirical factor graph Radeacher coplexity R G S H) of a hypothesis set H for a saple S = x,, x ) and factor graph G as follows: R G S H) = E sup Fi ɛ i,f,y h f x i, y), ɛ f F i y Y f where ɛ = ɛ i,f,y ) i,f Fi,y Y f and where ɛ i,f,y s are independent Radeacher rando variables uniforly distributed over {±} The factor graph Radeacher coplexity of H for a factor graph G is defined as the expectation: R G H) = E S D RG S H) It can be shown that the epirical factor graph Radeacher coplexity is concentrated around its ean Lea 8) The factor graph Radeacher coplexity is a natural extension of the standard Radeacher coplexity to vectorvalued hypothesis sets with one coordinate per factor in our case) For binary classification, the factor graph and standard Radeacher coplexities coincide Otherwise, the factor graph coplexity can be upper bounded in ters of the standard one As with the standard Radeacher coplexity, the factor graph Radeacher coplexity of a hypothesis set can be estiated fro data in any cases In soe iportant cases, it also adits explicit upper bounds siilar to those for the standard Radeacher coplexity but with an itional dependence on the factor graph quantities We will prove this for several failies of functions which are coonly used in structured prediction Theore ) 3 Generalization bounds In this section, we present new argin bounds for structured prediction based on the factor graph Radeacher coplexity of H Our results hold both for the itive and the ultiplicative epirical argin losses defined below: R S,h) = E Φ ax x,y) S y y Ly, y) hx, y) hx, y ) ) 5) R S, ult h) = E Φ ax y y Ly, y) )) ) hx, y) hx, y 6) x,y) S Here, Φ r) = inm, ax0, r)) for all r, with M = ax y,y Ly, y ) As we show in Section 5, ult convex upper bounds on R S, h) and R S, h) directly lead to any existing structured prediction algoriths The following is our general data-dependent argin bound for structured prediction 4

5 Theore Fix > 0 For any δ > 0, with probability at least δ over the draw of a saple S of size, the following holds for all h H, Rh) R h) R S,h) + 4 log RG δ H) + M, Rh) R ult h) R ult S, h) + 4 M R G H) + M log δ The full proof of Theore is given in Appendix A It is based on a new contraction lea Lea 5) generalizing Talagrand s lea that can be of independent interest We also present a ore refined contraction lea Lea 6) that can be used to iprove the bounds of Theore Theore is the first data-dependent generalization guarantee for structured prediction with general loss functions, general hypothesis sets, and arbitrary factor graphs for both ultiplicative and itive argins We also present a version of this result with epirical coplexities as Theore 7 in the suppleentary aterial We will copare these guarantees to known special cases below The argin bounds above can be extended to hold uniforly over 0, at the price of an itional ter of the for log log )/ in the bound, using known techniques see for exaple Mohri et al, 0) The hypothesis set used by convex structured prediction algoriths such as StructSVM Tsochantaridis et al, 005, Max-Margin Markov Networks M3N) Taskar et al, 003 or Conditional Rando Field CRF) Lafferty et al, 00 is that of linear functions More precisely, let Ψ be a feature apping fro X Y) to R N such that Ψx, y) = f F Ψ f x, y f ) For any p, define H p as follows: H p = {x w Ψx, y): w R N, w p Λ p } Then, RG H p ) can be efficiently estiated using rando sapling and solving LP progras Moreover, one can obtain explicit upper bounds on R G H p ) To siplify our presentation, we will consider the case p =,, but our results can be extended to arbitrary p and, ore generally, to arbitrary group nors Theore For any saple S = x,, x ), the following upper bounds hold for the epirical factor graph coplexity of H and H : R G S H ) Λ r s logn), RG S H ) Λ r f F i y Y f F i, where r = ax i,f,y Ψ f x i, y), r = ax i,f,y Ψ f x i, y) and where s is a sparsity factor defined by s = ax j,n f F i y Y f F i Ψf,j x i,y) 0 Plugging in these factor graph coplexity upper bounds into Theore iediately yields explicit data-dependent structured prediction learning guarantees for linear hypotheses with general loss functions and arbitrary factor graphs see Corollary 0) Observe that, in the worst case, the sparsity factor can be bounded as follows: s F i y Y f f F i F i d i ax F i d i, i where d i = ax f Fi Y f Thus, the factor graph Radeacher coplexities of linear hypotheses in H scale as O logn) ax i F i d i /) An iportant observation is that F i and d i depend on the observed saple This shows that the expected size of the factor graph is crucial for learning in this scenario This should be contrasted with other existing structured prediction guarantees that we discuss below, which assue a fixed upper bound on the size of the factor graph Note that our result shows that learning is possible even with an infinite set Y To the best of our knowledge, this is the first learning guarantee for learning with infinitely any classes A result siilar to Lea 5 has also been recently proven independently in Maurer, 06 5

6 Our learning guarantee for H can itionally benefit fro the sparsity of the feature apping and observed data In particular, in any applications, Ψ f,j is a binary indicator function that is non-zero for a single x, y) X Y f For instance, in NLP, Ψ f,j ay indicate an occurrence of a certain n-gra in the input x i and output y i In this case, s = F i ax i F i and the coplexity ter is only in Oax i F i logn)/), where N ay depend linearly on d i 33 Special cases and coparisons Markov networks For the pairwise Markov networks with a fixed nuber of substructures l studied by Taskar et al 003, our equivalent factor graph adits l nodes, F i = l, and the axiu size of Y f is d i = k if each substructure of a pair can be assigned one of k classes Thus, if we apply Corollary 0 with Haing distance as our loss function and divide the bound through by l, to noralize the loss to interval 0, as in Taskar et al, 003, we obtain the following explicit for of our guarantee for an itive epirical argin loss, for all h H : Rh) R S,h) + 4Λ r k + 3 log δ This bound can be further iproved by eliinating the dependency on k using an extension of our contraction Lea 5 to, see Lea 6) The coplexity ter of Taskar et al 003 is bounded by a quantity that varies as Õ Λ q r /), where q is the axial out-degree of a factor graph Our bound has the sae dependence on these key quantities, but with no logarithic ter in our case Note that, unlike the result of Taskar et al 003, our bound also holds for general loss functions and different p-nor regularizers Moreover, our result for a ultiplicative epirical argin loss is new, even in this special case Multi-class classification For standard unstructured) ulti-class classification, we have F i = and d i = c, where c is the nuber of classes In that case, for linear hypotheses with nor- regularization, the coplexity ter of our bound varies as OΛ r c/ ) Corollary ) This iproves upon the best known general argin bounds of Kuznetsov et al 04, who provide a guarantee that scales linearly with the nuber of classes instead Moreover, in the special case where an individual w y is learned for each class y c, we retrieve the recent favorable bounds given by Lei et al 05, albeit with a soewhat sipler forulation In that case, for any x, y), all coponents of the feature vector Ψx, y) are zero, except perhaps) for the N coponents corresponding to class y, where N is the diension of w y In view of that, for exaple for a group-nor, - regularization, the coplexity ter of our bound varies as OΛr log c)/ ), which atches the results of Lei et al 05 with a logarithic dependency on c ignoring soe coplex exponents of log c in their case) Additionally, note that unlike existing ulti-class learning guarantees, our results hold for arbitrary loss functions See Corollary for further details Our sparsity-based bounds can also be used to give bounds with logarithic dependence on the nuber of classes when the features only take values in {0, } Finally, using Lea 6 instead of Lea 5, the dependency on the nuber of classes can be further iproved We conclude this section by observing that, since our guarantees are expressed in ters of the average size of the factor graph over a given saple, this invites us to search for a hypothesis set H and predictor h H such that the tradeoff between the epirical size of the factor graph and epirical error is optial In the next section, we will ake use of the recently developed principle of Voted Risk Miniization VRM) Cortes et al, 05 to reach this objective 4 Voted Risk Miniization In any structured prediction applications such as natural language processing and coputer vision, one ay wish to exploit very rich features However, the use of rich failies of hypotheses could lead to overfitting In this section, we show that it ay be possible to use rich failies in conjunction with sipler failies, provided that fewer coplex hypotheses are used or that they are used with less ixture weight) We achieve this goal by deriving learning guarantees for ensebles of structured prediction rules that explicitly account for the differing coplexities between failies This will otivate the algoriths that we present in Section 5 6

7 Assue that we are given p failies H,, H p of functions apping fro X Y to R Define the enseble faily F = conv p k= H k), that is the faily of functions f of the for f = T t= α th t, where α = α,, α T ) is in the siplex and where, for each t, T, h t is in H kt for soe k t, p We further assue that R G H ) R G H ) R G H p ) As an exaple, the H k s ay be ordered by the size of the corresponding factor graphs The ain result of this section is a generalization of the VRM theory to the structured prediction setting The learning guarantees that we present are in ters of upper bounds on h) and R ult S, h), which are defined as follows for all τ 0: R S,,τ h) = E Φ ax R ult S,,τ h) = x,y) S E x,y) S R S, y y Ly, y) + τ hx, y) hx, y ) ) 7) + τ )) ) hx, y) hx, y 8) Φ ax y y Ly, y) Here, τ can be interpreted as a argin ter that acts in conjunction with For siplicity, we assue in this section that Y = c < + Theore 3 Fix > 0 For any δ > 0, with probability at least δ over the draw of a saple S of size, each of the following inequalities holds for all f F: Rf) R S,,f) 4 T α t R G H kt ) + C, M, c,, p), Rf) where C, M, c,, p) = M t= ult R S,,f) 4 M T α t R G H kt ) + C, M, c,, p), t= log p + 3M 4 log ) c log p 4 log p + log δ The proof of this theore crucially depends on the theory we developed in Section 3 and is given in Appendix A As with Theore, we also present a version of this result with epirical coplexities as Theore 4 in the suppleentary aterial The explicit dependence of this bound on the paraeter vector α suggests that learning even with highly coplex hypothesis sets could be possible so long as the coplexity ter, which is a weighted average of the factor graph coplexities, is not too large The theore provides a quantitative way of deterining the ixture weights that should be apportioned to each faily Furtherore, the dependency on the nuber of distinct feature ap failies H k is very ild and therefore suggests that a large nuber of failies can be used These properties will be useful for otivating new algoriths for structured prediction 5 Algoriths In this section, we derive several algoriths for structured prediction based on the VRM principle discussed in Section 4 We first give general convex upper bounds Section 5) on the structured prediction loss which recover as special cases the loss functions used in StructSVM Tsochantaridis et al, 005, Max-Margin Markov Networks M3N) Taskar et al, 003, and Conditional Rando Field CRF) Lafferty et al, 00 Next, we introduce a new algorith, Voted Conditional Rando Field VCRF) Section 5, with accopanying experients as proof of concept We also present another algorith, Voted StructBoost VStructBoost), in Appendix C 5 General fraework for convex surrogate losses Given x, y) X Y, the apping h Lhx), y) is typically not a convex function of h, which leads to coputationally hard optiization probles This otivates the use of convex surrogate losses We first introduce a general forulation of surrogate losses for structured prediction probles Lea 4 For any u R +, let Φ u : R R be an upper bound on v u v 0 Then, the following upper bound holds for any h H and x, y) X Y, Lhx), y) ax y y Φ Ly,y)hx, y) hx, y )) 9) 7

8 The proof is given in Appendix A This result defines a general fraework that enables us to straightforwardly recover any of the ost coon state-of-the-art structured prediction algoriths via suitable choices of Φ u v): a) for Φ u v) = ax0, u v)), the right-hand side of 9) coincides with the surrogate loss defining StructSVM Tsochantaridis et al, 005; b) for Φ u v) = ax0, u v), it coincides with the surrogate loss defining Max-Margin Markov Networks M3N) Taskar et al, 003 when using for L the Haing loss; and c) for Φ u v) = log + e u v ), it coincides with the surrogate loss defining the Conditional Rando Field CRF) Lafferty et al, 00 Moreover, alternative choices of Φ u v) can help define new algoriths In particular, we will refer to the algorith based on the surrogate loss defined by Φ u v) = ue v as StructBoost, in reference to the exponential loss used in AdaBoost Another related alternative is based on the choice Φ u v) = e u v See Appendix C, for further details on this algorith In fact, for each Φ u v) described above, the corresponding convex surrogate is an upper bound on either the ultiplicative or itive argin loss introduced in Section 3 Therefore, each of these algoriths seeks a hypothesis that iniizes the generalization bounds presented in Section 3 To the best of our knowledge, this interpretation of these well-known structured prediction algoriths is also new In what follows, we derive new structured prediction algoriths that iniize finer generalization bounds presented in Section 4 5 Voted Conditional Rando Field VCRF) We first consider the convex surrogate loss based on Φ u v) = log + e u v ), which corresponds to the loss defining CRF odels Using the onotonicity of the logarith and upper bounding the axiu by a su gives the following upper bound on the surrogate loss holds: ax log + ) w Ψx,y) Ψx,y )) ) log y y ely,y ) e Ly,y ) w Ψx,y) Ψx,y )), which, cobined with VRM principle leads to the following optiization proble: ) p in log e Ly,yi) w Ψxi,yi) Ψxi,y)) + λr k + β) w k, 0) w y Y where r k = r F k) log N We refer to the learning algorith based on the optiization proble 0) as VCRF Note that for λ = 0, 0) coincides with the objective function of L - regularized CRF Observe that we can also directly use ax y y log + e Ly,y ) w δψx,y,y ) ) or its upper bound y y log + ) w δψx,y,y ) ) as a convex surrogate We can siilarly derive ely,y an L -regularization forulation of the VCRF algorith In Appendix D, we describe efficient algoriths for solving the VCRF and VStructBoost optiization probles 6 Experients In Appendix B, we corroborate our theory by reporting experiental results suggesting that the VCRF algorith can outperfor the CRF algorith on a nuber of part-of-speech POS) datasets 7 Conclusion We presented a general theoretical analysis of structured prediction Our data-dependent argin guarantees for structured prediction can be used to guide the design of new algoriths or to derive guarantees for existing ones Its explicit dependency on the properties of the factor graph and on feature sparsity can help shed new light on the role played by the graph and features in generalization Our extension of the VRM theory to structured prediction provides a new analysis of generalization when using a very rich set of features, which is coon in applications such as natural language processing and leads to new algoriths, VCRF and VStructBoost Our experiental results for VCRF serve as a proof of concept and otivate ore extensive epirical studies of these algoriths Acknowledgents This work was partly funded by NSF CCF and IIS-6866 and NSF GRFP DGE y Y k= 8

9 References G H Bakir, T Hofann, B Schölkopf, A J Sola, B Taskar, and S V N Vishwanathan Predicting Structured Data Neural Inforation Processing) The MIT Press, 007 K Chang, A Krishnaurthy, A Agarwal, H Daué III, and J Langford Learning to search better than your teacher In ICML, 05 M Collins Paraeter estiation for statistical parsing odels: Theory and practice of distribution-free ethods In Proceedings of IWPT, 00 C Cortes, M Mohri, and J Weston A General Regression Fraework for Learning String-to-String Mappings In Predicting Structured Data MIT Press, 007 C Cortes, V Kuznetsov, and M Mohri Enseble ethods for structured prediction In ICML, 04 C Cortes, P Goyal, V Kuznetsov, and M Mohri Kernel extraction via voted risk iniization JMLR, 05 H Daué III, J Langford, and D Marcu Search-based structured prediction Machine Learning, 753): 97 35, 009 J R Doppa, A Fern, and P Tadepalli Structured prediction via output space search JMLR, 5):37 350, 04 D Jurafsky and J H Martin Speech and Language Processing nd Edition) Prentice-Hall, Inc, 009 V Kuznetsov, M Mohri, and U Syed Multi-class deep boosting In NIPS, 04 J Lafferty, A McCallu, and F Pereira Conditional rando fields: Probabilistic odels for segenting and labeling sequence data In ICML, 00 M La, J R Doppa, S Todorovic, and T G Dietterich Hc-search for structured prediction in coputer vision In CVPR, 05 Y Lei, Ü D Dogan, A Binder, and M Kloft Multi-class svs: Fro tighter data-dependent generalization bounds to novel algoriths In NIPS, 05 A Lucchi, L Yunpeng, and P Fua Learning for structured prediction using approxiate subgradient descent with working sets In CVPR, 03 A Maurer A vector-contraction inequality for radeacher coplexities In ALT, 06 D McAllester Generalization bounds and consistency for structured labeling In Predicting Structured Data MIT Press, 007 M Mohri, A Rostaizadeh, and A Talwalkar Foundations of Machine Learning The MIT Press, 0 D Nadeau and S Sekine A survey of naed entity recognition and classification Linguisticae Investigationes, 30):3 6, January 007 S Ross, G J Gordon, and D Bagnell A reduction of iitation learning and structured prediction to no-regret online learning In AISTATS, 0 B Taskar, C Guestrin, and D Koller Max-argin Markov networks In NIPS, 003 I Tsochantaridis, T Joachis, T Hofann, and Y Altun Large argin ethods for structured and interdependent output variables JMLR, 6: , Dec 005 O Vinyals, L Kaiser, T Koo, S Petrov, I Sutskever, and G Hinton Graar as a foreign language In NIPS, 05a O Vinyals, A Toshev, S Bengio, and D Erhan Show and tell: A neural iage caption generator In CVPR, 05b D Zhang, L Sun, and W Li A structured prediction approach for statistical achine translation In IJCNLP, 008 9

10 A Proofs This appendix section gathers detailed proofs of all of our ain results In Appendix A, we prove a contraction lea used as a tool in the proof of our general factor graph Radeacher coplexity bounds Appendix A3) In Appendix A8, we further extend our bounds to the Voted Risk Miniization setting Appendix A5 gives explicit upper bounds on the factor graph Radeacher coplexity of several coonly used hypothesis sets In Appendix A9, we prove a general upper bound on a loss function used in structured prediction in ters of a convex surrogate A Contraction lea The following contraction lea will be a key tool used in the proofs of our generalization bounds for structured prediction Lea 5 Let H be a hypothesis set of functions apping X to R c Assue that for all i =,,, Ψ i : R c R is µ i -Lipschitz for R c equipped with the -nor That is: Ψ i x ) Ψ i x) µ i x x, for all x, x ) R c ) Then, for any saple S of points x,, x X, the following inequality holds c E sup σ i Ψ i hx i )) σ E sup ɛ ij µ i h j x i ), ) ɛ where ɛ = ɛ ij ) i,j and ɛ ij s are independent Radeacher variables uniforly distributed over {±} Proof Fix a saple S = x,, x ) Then, we can rewrite the left-hand side of ) as follows: E sup σ i Ψ i hx i )) = E E sup U h) + σ Ψ hx )), σ σ,,σ σ where U h) = σ iψ i hx i )) Assue that the suprea can be attained and let h, h H be the hypotheses satisfying U h ) + Ψ h x )) = sup U h) + Ψ hx )) U h ) Ψ h x )) = sup U h) Ψ hx )) When the suprea are not reached, a siilar arguent to what follows can be given by considering instead hypotheses that are ɛ-close to the suprea for any ɛ > 0 By definition of expectation, since σ is uniforly distributed over {±}, we can write E sup U h) + σ Ψ hx )) σ = sup U h) + Ψ hx )) + sup U h) Ψ hx )) = U h ) + Ψ h x )) + U h ) Ψ h x )) Next, using the µ -Lipschitzness of Ψ and the Khintchine-Kahane inequality, we can write E sup U h) + σ Ψ hx )) σ U h ) + U h ) + µ h x ) h x ) c U h ) + U h ) + µ E ɛ j hj x ) h j x ) ) ɛ,,ɛ c 0

11 Now, let ɛ denote ɛ,, ɛ c ) and let sɛ ) {±} denote the sign of c ɛ j hj x ) h j x ) ) Then, the following holds: E sup U h) + σ Ψ h)x ) σ E c U h ) + U h ) + µ ɛ j hj x ) h j x ) ) ɛ = E ɛ U h ) + µ sɛ ) E sup ɛ c ɛ j h j x ) + U h ) µ sɛ ) c U h) + µ sɛ ) ɛ j h j x ) c ) ɛ j h j x ) c ) + sup U h) µ sɛ ) ɛ j h j x ) c = E E sup U h) + µ σ ɛ j h j x ) ɛ σ = E sup U h) + µ ɛ c ɛ j h j x ), Proceeding in the sae way for all other σ i s i < ) copletes the proof A Contraction lea for, -nor In this section, we present an extension of the contraction Lea 5, that can be used to reove the dependency on the alphabet size in all of our bounds Lea 6 Let H be a hypothesis set of functions apping X d to R c Assue that for all i =,,, Ψ i is µ i -Lipschitz for R c d equipped with the nor-, ) for soe µ i > 0 That is Ψ i x ) Ψ i x) µ i x x,, for all x, x ) R c d ) Then, for any saple S of points x,, x X, there exists a distribution U over d c such that the following inequality holds: c E sup σ i Ψ i hx i )) σ E sup ɛ ij µ i h j x i, υ j ), ) υ U,ɛ where ɛ = ɛ ij ) i,j and ɛ ij s are independent Radeacher variables uniforly distributed over {±} and υ = υ i,j ) i,j is a sequence of rando variables distributed according to U Note that υ i,j s theselves do not need to be independent Proof Fix a saple S = x,, x ) Then, we can rewrite the left-hand side of ) as follows: E sup σ i Ψ i hx i )) = E E sup U h) + σ Ψ hx )), σ σ,,σ σ where U h) = σ iψ i hx i )) Assue that the suprea can be attained and let h, h H be the hypotheses satisfying U h ) + Ψ h x )) = sup U h) + Ψ hx )) U h ) Ψ h x )) = sup U h) Ψ hx ))

12 When the suprea are not reached, a siilar arguent to what follows can be given by considering instead hypotheses that are ɛ-close to the suprea for any ɛ > 0 By definition of expectation, since σ is uniforly distributed over {±}, we can write E sup σ U h) + σ Ψ h x )) = sup U h) + Ψ h x )) + sup U h) Ψ hx )) = U h ) + Ψ h x )) + U h ) Ψ h x )) Next, using the µ -Lipschitzness of Ψ and the Khintchine-Kahane inequality, we can write E sup σ U h) + σ Ψ h)x ) U h ) + U h ) + µ h x ) h x ), c U h ) + U h ) + µ E ɛ j h,j x, ) h,j x, ) ɛ,,ɛ c Define the rando variables υ j = υ j σ) = argax k d h,j x, k) h,j x, k) Now, let ɛ denote ɛ,, ɛ c ) and let sɛ ) {±} denote the sign of c ɛ j h,j x, ) h,j x, ) Then, the following holds: E sup σ U h) + σ Ψ h)x ) E c U h ) + U h ) + µ ɛ j h,j x, ) h,j x, ) ɛ E ɛ U h ) + U h ) + µ sɛ ) c = E ɛ U h ) + U h ) + µ sɛ ) c = E ɛ U h ) + µ sɛ ) ɛ j h,j x, υ j ) h,j x, υ j ) ɛ j h,j x, υ j ) h,j x, υ j )) c ɛ j h,j x, υ j ) + U h ) µ sɛ ) c ɛ j h,j x, υ j )

13 After taking expectation over υ, the rest of the proof proceeds the sae way as the arguent in Lea 5: c E U h ) + µ sɛ ) ɛ j h,j x, υ j ) υ U,ɛ c + U h ) µ sɛ ) ɛ j h,j x, υ j ) E sup U h) + µ sɛ ) υ U,ɛ + sup U h) µ sɛ ) = E E sup U h) + µ σ υ U,ɛ σ c = E sup U h) + µ υ U,ɛ c ) ɛ j h j x, υ j ) c ) ɛ j h j x, υ j ) c ɛ j h j x, υ j ) ɛ j h j x, υ j ), Proceeding in the sae way for all other σ i s i < ) copletes the proof A3 General structured prediction learning bounds In this section, we give the proof of several general structured prediction bounds in ters of the notion of factor graph Radeacher coplexity We will use the itive and ultiplicative argin losses of a hypothesis h, which are the population versions of the epirical argin losses we introduced in 5) and 6) and are defined as follows: R h) = E x,y) D R ult h) = E x,y) D Φ ax hx, y) hx, y ) ) y y Ly, y) )) ) hx, y) hx, y Φ ax y y Ly, y) The following is our general argin bound for structured prediction Theore Fix > 0 For any δ > 0, with probability at least δ over the draw of a saple S of size, the following holds for all h H, Rh) R h) R S,h) + 4 log RG δ H) + M, Rh) R ult h) R ult S, h) + 4 M R G H) + M log δ Proof Let Φ u v) = Φ u v ), where Φ r) = inm, ax0, r)) Observe that for any u 0, M, u v 0 Φ u v) for all v Therefore, by Lea 4 and onotonicity of Φ, Rh) E ax Φ x,y) D y Ly y,y)hx, y) hx, y )) = E Φ ax Ly, y) hx, y) hx, y ) ) ) x,y) D y y = R h) 3

14 Define H 0 = H = { x, y) Φ ax y y { x, y) ax y y Ly, y) hx, y) hx, y ) Ly, y) hx, y) hx, y ) } )) : h H, } ) : h H By standard Radeacher coplexity bounds Koltchinskii and Panchenko 00), for any δ > 0, with probability at least δ, the following inequality holds for all h H: R log h) R δ S,h) + R H 0 ) + M, where R H 0 ) is the Radeacher coplexity of the faily H 0 : R H 0 ) = E E sup σ i Φ ax Ly, y i ) hx i, y i ) hx i, y ) )) S D σ y y i and where σ = σ,, σ ) with σ i s independent Radeacher rando variables uniforly distributed over {±} Since Φ is -Lipschitz, by Talagrand s contraction lea Ledoux and Talagrand 99, Mohri et al 0), we have R S H 0 ) R S H ) By taking an expectation over S, this inequality carries over to the true Radeacher coplexities as well Now, observe that by the sub-itivity of the supreu, the following holds: R S H ) E σ + E σ sup sup σ i ax y y i σ i hx i, y i ) Ly, y i ) + hx i, y ) ) where we also used for the last ter the fact that σ i and σ i adit the sae distribution We use Lea 5 to bound each of the two ters appearing on the right-hand ) side separately To do so, we we first show the Lipschitzness of h ax y y i Ly, y i ) + hxi,y ) Observe that the following chain of inequalities holds for any h, h H: ) ax Ly, y i ) + hx i, y) ax y y i y y i, Ly, y i ) + hx i, y) ) ax hxi, y) hx i, y) y y i ax hx i, y) hx i, y) y Y = ax h f x i, y f ) h f x i, y f )) y Y f F i ax h f x i, y f ) h f x i, y f )) y Y f F i = hf ax x i, y) h f x i, y)) y Y f f F i Fi ax h f x i, y) h f x i, y)) y Y f f F i Fi = ax h f x i, y) h f x i, y)) y Y f f F i Fi h f x i, y) h f x i, y)) y Y f f F i 4

15 We can therefore apply Lea 5, which yields E sup σ i ax Ly, y i ) + hx i, y ) ) σ y y i E Fi sup h f x i, y) = ɛ ɛ i,f,y f F i y Y f R G S H) Siilarly, for the second ter, observe that the following Lipschitz property holds: hx i, y i ) hx i, y i ) ax hx i, y) hx i, y) y Y Fi h f x i, y) h f x i, y)) f F i y Y We can therefore apply Lea 5 and obtain the following: E hx i, y i ) sup σ i σ E sup ɛ ɛ i,f,y f F i y Y f Fi h f x i, y) = R G S H) Taking the expectation over S of the two inequalities shows that R H ) RG H), which copletes the proof of the first stateent The second stateent can be proven in a siilar way with Φ u v) = Φ u v )) In particular, by standard Radeacher coplexity bounds, McDiarid s inequality, and Talagrand s contraction lea, we can write where H = { R ult h) x, y) ax y y Ly, y) We observe that the following inequality holds: ) ax Ly, y i ) hx i, y i ) hx i, y) y y i R ult S, h) + R H ) + M log δ, hx, y) hx, y ) ax y y i Ly, y i ) M ax y Y } ) : h H hx i, y i ) hx ) i, y) hx i, y) hx i, y) Then, the rest of the proof follows fro Lea 5 as in the previous arguent In the proof above, we could have applied McDiarid s inequality to bound the Radeacher coplexity of H 0 by its epirical counterpart at the cost of slightly increasing the exponential concentration ter: R h) R S,h) + R S H 0 ) + 3M log δ Since Talagrand s contraction lea holds for epirical Radeacher coplexities and the reainder of the proof involves bounding the epirical Radeacher coplexity of H before taking an expectation over the saple at the end, we can apply the sae arguents without the final expectation to arrive at the following analogue of Theore in ters of epirical coplexities: 5

16 Theore 7 Fix > 0 For any δ > 0, with probability at least δ over the draw of a saple S of size, the following holds for all h H, Rh) R h) R S,h) + 4 R G log δ S H) + 3M, Rh) R ult h) ult R S, h) + 4 M R G S H) + 3M log δ This theore will be useful for any of our applications, which are based on bounding the epirical factor graph Radeacher coplexity for different hypothesis classes A4 Concentration of the epirical factor graph Radeacher coplexity In this section, we show that, as with the standard notion of Radeacher coplexity, the epirical factor graph Radeacher coplexity also concentrates around its ean Lea 8 Let H be a faily of scoring functions apping X Y R bounded by a constant C Let S be a training saple of size drawn iid according to soe distribution D on X Y, and let D X be the arginal distribution on X For any point x X, let F x denote its associated set of factor nodes Then, with probability at least δ over the draw of saple S D, R G S H) R G H) C sup x suppd X ) Y f Fx f F x log δ Proof Let S = x, x,, x ) and S = x, x,, x ) be two saples differing by one point x j and x j ie x i = x i for i j) Then R G S H) R G S H) E sup Fi ɛ i,f,y h f x i, y) ɛ E ɛ = E ɛ f F i y Y f sup Fi ɛ i,f,y h f x i, y) f F i y Y f sup f F xj y Y f sup f F x y Y j f sup x suppd X ) f F x y Y f F j ɛ j,f,y h f x j, y) F x j ɛ j,f,y h f x j, y) Fx h f x, y) The sae upper bound also holds for R G S H) R G S H) The result now follows fro McDiarid s inequality A5 Bounds on the factor graph Radeacher coplexity The following lea is a standard bound on the expectation of the axiu of n zero-ean bounded rando variables, which will be used in the proof of our bounds on factor graph Radeacher coplexity Lea 9 Let X X n be n real-valued rando variables such that for all j, n, X j = j Y ij where, for each fixed j, n, Y ij are independent zero ean rando variables with Y ij t ij Then, the following inequality holds: E ax X j t log n, j,n 6

17 j with t = ax j,n t ij The following are upper bounds on the factor graph Radeacher coplexity for H and H, as defined in Section 3 Siilar guarantees can be given for other hypothesis sets H p with p > Theore For any saple S = x,, x ), the following upper bounds hold for the epirical factor graph coplexity of H and H : R G S H ) Λ r s logn), RG S H ) Λ r f F i y Y f F i, where r = ax i,f,y Ψ f x i, y), r = ax i,f,y Ψ f x i, y) and where s is a sparsity factor defined by s = ax j,n f F i y Y f F i Ψjx i,y) 0 Proof By definition of the dual nor and Lea 9 or Massart s lea), the following holds: R G S H ) = E sup w Fi ɛ i,f,y Ψ f x i, y) ɛ w Λ f F i y Y f = Λ E Fi ɛ i,f,y Ψ f x i, y) ɛ f F i y Y f = Λ E ax σ Fi ɛ i,f,y Ψ f,j x i, y) ɛ j,n,σ {,+} f F i y Y f = Λ E ax σ Fi ɛ i,f,y Ψ f,j x i, y) Ψf,jx ɛ i,y) 0 j,n,σ {,+} Λ ax j,n f F i = Λ r s logn), f F i y Y f ) F i Ψjx i,y) 0 r logn) y Y f which copletes the proof of the first stateent The second stateent can be proven in a siilar way using the the definition of the dual nor and Jensen s inequality: R G S H ) = E sup w Fi ɛ i,f,y Ψ f x i, y) ɛ w Λ which concludes the proof f F i y Y f f F i y Y f = Λ E Fi ɛ i,f,y Ψ f x i, y) ɛ ) = Λ E Fi ɛ i,f,y Ψ f x i, y) ɛ f F i f F i y Y f = Λ Λ r f F i y Y f F i Ψ f x i, y) y Y f F i, A6 Learning guarantees for structured prediction with linear hypotheses The following result is a direct consequence of Theore 7 and Theore ) 7

18 Corollary 0 Fix > 0 For any δ > 0, with probability at least δ over the draw of a saple S of size, the following holds for all h H, Rh) R S,h) + 4 Λ log r s logn) + 3M δ, ult Rh) R S, h) + 4 M Λ log r s logn) + 3M δ Siilarly, for any δ > 0, with probability at least δ over the draw of a saple S of size, the following holds for all h H, Rh) Rh) R S,h) + 4 Λ r f F i y Y f F i + 3M ult R S, h) + 4 M Λ r f F i y Y f F i + 3M log δ, A7 Learning guarantees for ulti-class classification with linear hypotheses log δ The following result is a direct consequence of Corollary 0 and the observation that for ulti-class classification F i = and d i = ax f Fi Y f = c Note that our ulti-class learning guarantees hold for arbitrary bounded losses To the best of our knowledge this is a novel result in this setting In particular, these guarantees apply to the special case of the standard ulti-class zero-one loss Ly, y ) = {y y } which is bounded by M = Corollary Fix > 0 For any δ > 0, with probability at least δ over the draw of a saple S of size, the following holds for all h H, Rh) R S,h) + 4 Λ r c logn) log δ + 3M, ult Rh) R S, h) + 4 Λ r c logn) log δ + 3M Siilarly, for any δ > 0, with probability at least δ over the draw of a saple S of size, the following holds for all h H, Rh) R S,h) + 4 Λ r c + 3M log δ, Rh) ult R S, h) + 4 Λ r c + 3M log δ Consider the following set of linear hypothesis: H, = {x w Ψx, y): w, Λ,, y c}, where Ψx, y) = 0, 0, Ψ y x), 0,, 0) T R N,Nc and w = w,, w c ) with w, = c y= w y In this case, w Ψx, y) = w y Ψ y x) The standard scenario in ulti-class classification is when Ψ y x) = Ψx) is the sae for all y Corollary Fix > 0 For any δ > 0, with probability at least δ over the draw of a saple S of size, the following holds for all h H,, log δ Rh), R S,h) + 6Λ,r, logc)) /4 + 3M ult Rh) R S, h) + 6Λ,r, logc)) /4 where r, = ax i,y Ψ y x i ) + 3M log δ, 8

19 Proof By definition of the dual nor and H,, the following holds: R G S H, ) = E sup w ɛ i,y Ψx i, y) ɛ w, Λ y c = Λ E ɛ i,y Ψx i, y) ɛ, = Λ E ɛ ax y y c ɛ i,y Ψ y x i ) ) / Λ Eɛ ax y ɛ i,y Ψ y x i ) = Λ Λ Eɛ ax y ax y Ψ yx i ) + i j Ψ yx i ) + E ax ɛ y ) / ɛ i,y ɛ j,y Ψ y x i ) Ψ y x j ) i j ) / ɛ i,y ɛ j,y Ψ y x i ) Ψ y x j ) By Lea 9 or Massart s lea), the following bound holds: E ax ɛ i,y ɛ j,y Ψ y x i ) Ψ y x j ) r, logc) ɛ y i j Since, ax y Ψ yx i ) r,, we obtain that the following result holds: R G Λr, logc)) /4 S H, ), and applying Theore copletes the proof A8 VRM structured prediction learning bounds Here, we give the proof of our structured prediction learning guarantees in the setting of Voted Risk Miniization We will use the following lea Lea 3 The function Φ is sub-itive: Φ x + y) Φ x) + Φ y), for all x, y R Proof By the sub-itivity of the axiu function, for any x, y R, the following upper bound holds for Φ x + y): Φ x + y) = inm, ax0, x + y)) inm, ax0, x) + ax0, y)) inm, ax0, x)) + inm, ax0, y)) = Φ x) + Φ y), which copletes the proof For the following proof, for any τ 0, the argin losses R,τ h) and R,τ ult h) are defined as the population counterparts of the epirical losses define by 7) and 8) Theore 3 Fix > 0 For any δ > 0, with probability at least δ over the draw of a saple S of size, each of the following inequalities holds for all f F: Rf) R S,,f) 4 T α t R G H kt ) + C, M, c,, p), Rf) t= R ult S,,f) 4 M T α t R G H kt ) + C, M, c,, p) t= 9

20 where C, M, c,, p) = M log p + 3M 4 log ) c log p 4 log p + log δ Proof The proof akes use of Theore and the proof techniques of Kuznetsov et al 04Theore but requires a finer analysis both because of the general loss functions used here and because of the ore coplex structure of the hypothesis set For a fixed h = h,, h T ), any α in the probability siplex defines a distribution over {h,, h T } Sapling fro {h,, h T } according to α and averaging leads to functions g of the for g = T n n th t for soe n = n,, n T ) N T, with T t= n t = n, and h t H kt For any N = N,, N p ) with N = n, we consider the faily of functions G F,N = { n p N k h k,j k, j) p N k, h k,j H k }, k= and the union of all such failies G F,n = N =n G F,N Fix > 0 For a fixed N, the epirical factor graph Radeacher coplexity of G F,N can be bounded as follows for any : R G S G F,N ) n p N k RG S H k ), which also iplies the result for the true factor graph Radeacher coplexities Thus, by Theore, the following learning bound holds: for any δ > 0, with probability at least δ, for all g G F,N, R, g) R S,, g) 4 p N k R G log δ n H k ) + M k= Since there are at ost p n possible p-tuples N with N = n, 3 by the union bound, for any δ > 0, with probability at least δ, for all g G F,n, we can write R, g) R S,, g) 4 p N k R G log pn δ n H k ) + M k= k= Thus, with probability at least δ, for all functions g = T n n th t with h t H kt, the following inequality holds R, g) R S,, g) 4 p n t R G log pn δ n H kt ) + M k= t:k t=k Taking the expectation with respect to α and using E α n t /n = α t, we obtain that for any δ > 0, with probability at least δ, for all g, we can write ER α, g) R S,, g) 4 T α t R G log pn δ H kt ) + M t= Fix n Then, for any δ n > 0, with probability at least δ n, ER α, g) R S,, g) 4 T log pn α t R G δ H kt ) + M n 3 The nuber Sp, n) of p-tuples N with N = n is known to be precisely ) p+n p t= 0

21 Choose δ n = δ p for soe δ > 0, then for p, n n δ δ n = /p) δ Thus, for any δ > 0 and any n, with probability at least δ, the following holds for all g: ER α, g) R S,, g) 4 T α t R G log pn δ H kt ) + M 3) t= Now, for any f = T t= α th t F and any g = T n n th t, using 4), we can upper bound Rf), the generalization error of f, as follows: Rf) = E Lfx), y) f x,y) 0 4) E Lfx), y) f x,y) gx,y) gx,y f ))< / M Pr f x, y) gx, y) gx, y f )) < / + E Lfx), y) gx,y) gx,yf ) / + E Lfx), y) gx,y) gx,yf ) /, where for any function ϕ: X Y 0,, we define y ϕ as follows: y ϕ = argax ϕx, y) y y Using the sae arguents as in the proof of Lea 4, one can show that E Lfx), y)) gx,y) gx,yf )</ R, g) We now give a lower-bound on of R S,, g): R S,, f) in ters of R g) To do so, we start with the expression S,, R S,, g) = E Φ ax x,y) S y y Ly, y) + gx, y) gx, y ) ) By the sub-itivity of ax, we can write ax y y Ly, y) + gx, y) gx, y ) { } ax Ly, y ) + fx, y) fx, y ) y y { } + ax y y + fx, y) fx, y ) gx, y) gx, y ) = X + Y, where X and Y are defined by ) X = ax Ly, y ) + fx, y) fx, y ), y y Y = + ax y y fx, y) fx, y ) gx, y) gx, y ) In view of that, since Φ is non-decreasing and sub-itive Lea 3), we can write R S,, g) E x,y) S Φ X + Y ) 5) = = E x,y) S Φ X) + Φ Y ) = R S,,f) + R S,,f) + M R S,,f) + M E x,y) S Φ Y ) E x,y) S Y >0 Pr x,y) S ax y y E x,y) S Φ X) + E x,y) S Φ Y ) { fx, y) gx, y) + gx, y ) fx, y )) } > / )

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute