Structured Prediction Theory Based on Factor Graph Complexity

Size: px
Start display at page:

Download "Structured Prediction Theory Based on Factor Graph Complexity"

Transcription

1 Structured Prediction Theory Based on Factor Graph Coplexity Corinna Cortes Google Research New York, NY 00 Mehryar Mohri Courant Institute and Google New York, NY 00 Vitaly Kuznetsov Google Research New York, NY 00 Scott Yang Courant Institute New York, NY 00 Abstract We present a general theoretical analysis of structured prediction with a series of new results We give new data-dependent argin guarantees for structured prediction for a very wide faily of loss functions and a general faily of hypotheses, with an arbitrary factor graph decoposition These are the tightest argin bounds known for both standard ulti-class and general structured prediction probles Our guarantees are expressed in ters of a data-dependent coplexity easure, factor graph coplexity, which we show can be estiated fro data and bounded in ters of failiar quantities for several coonly used hypothesis sets along with a sparsity easure for features and graphs Our proof techniques include generalizations of Talagrand s contraction lea that can be of independent interest We further extend our theory by leveraging the principle of Voted Risk Miniization VRM) and show that learning is possible even with coplex factor graphs We present new learning bounds for this advanced setting, which we use to design two new algoriths, Voted Conditional Rando Field VCRF) and Voted Structured Boosting StructBoost) These algoriths can ake use of coplex features and factor graphs and yet benefit fro favorable learning guarantees We also report the results of experients with VCRF on several datasets to validate our theory Introduction Structured prediction covers a broad faily of iportant learning probles These include key tasks in natural language processing such as part-of-speech tagging, parsing, achine translation, and naed-entity recognition, iportant areas in coputer vision such as iage segentation and object recognition, and also crucial areas in speech processing such as pronunciation odeling and speech recognition In all these probles, the output space adits soe structure This ay be a sequence of tags as in part-of-speech tagging, a parse tree as in context-free parsing, an acyclic graph as in dependency parsing, or labels of iage segents as in object detection Another property coon to these tasks is that, in each case, the natural loss function adits a decoposition along the output substructures As an exaple, the loss function ay be the Haing loss as in part-of-speech tagging, or it ay be the edit-distance, which is widely used in natural language and speech processing 9th Conference on Neural Inforation Processing Systes NIPS 06), Barcelona, Spain

2 The output structure and corresponding loss function ake these probles significantly different fro the unstructured) binary classification probles extensively studied in learning theory In recent years, a nuber of different algoriths have been designed for structured prediction, including Conditional Rando Field CRF) Lafferty et al, 00, StructSVM Tsochantaridis et al, 005, Maxiu-Margin Markov Network M3N) Taskar et al, 003, a kernel-regression algorith Cortes et al, 007, and search-based approaches such as Daué III et al, 009, Doppa et al, 04, La et al, 05, Chang et al, 05, Ross et al, 0 More recently, deep learning techniques have also been developed for tasks including part-of-speech tagging Jurafsky and Martin, 009, Vinyals et al, 05a, naed-entity recognition Nadeau and Sekine, 007, achine translation Zhang et al, 008, iage segentation Lucchi et al, 03, and iage annotation Vinyals et al, 05b However, in contrast to the plethora of algoriths, there have been relatively few studies devoted to the theoretical understanding of structured prediction Bakir et al, 007 Existing learning guarantees hold priarily for siple losses such as the Haing loss Taskar et al, 003, Cortes et al, 04, Collins, 00 and do not cover other natural losses such as the edit-distance They also typically only apply to specific factor graph odels The ain exception is the work of McAllester 007, which provides PAC-Bayesian guarantees for arbitrary losses, though only in the special case of randoized algoriths using linear count-based) hypotheses This paper presents a general theoretical analysis of structured prediction with a series of new results We give new data-dependent argin guarantees for structured prediction for a broad faily of loss functions and a general faily of hypotheses, with an arbitrary factor graph decoposition These are the tightest argin bounds known for both standard ulti-class and general structured prediction probles For special cases studied in the past, our learning bounds atch or iprove upon the previously best bounds see Section 33) In particular, our bounds iprove upon those of Taskar et al 003 Our guarantees are expressed in ters of a data-dependent coplexity easure, factor graph coplexity, which we show can be estiated fro data and bounded in ters of failiar quantities for several coonly used hypothesis sets along with a sparsity easure for features and graphs We further extend our theory by leveraging the principle of Voted Risk Miniization VRM) and show that learning is possible even with coplex factor graphs We present new learning bounds for this advanced setting, which we use to design two new algoriths, Voted Conditional Rando Field VCRF) and Voted Structured Boosting StructBoost) These algoriths can ake use of coplex features and factor graphs and yet benefit fro favorable learning guarantees As a proof of concept validating our theory, we also report the results of experients with VCRF on several datasets The paper is organized as follows In Section we introduce the notation and definitions relevant to our discussion of structured prediction In Section 3, we derive a series of new learning guarantees for structured prediction, which are then used to prove the VRM principle in Section 4 Section 5 develops the algorithic fraework which is directly based on our theory In Section 6, we provide soe preliinary experiental results that serve as a proof of concept for our theory Preliinaries Let X denote the input space and Y the output space In structured prediction, the output space ay be a set of sequences, iages, graphs, parse trees, lists, or soe other typically discrete) objects aditting soe possibly overlapping structure Thus, we assue that the output structure can be decoposed into l substructures For exaple, this ay be positions along a sequence, so that the output space Y is decoposable along these substructures: Y = Y Y l Here, Y k is the set of possible labels or classes) that can be assigned to substructure k Loss functions We denote by L: Y Y R + a loss function easuring the dissiilarity of two eleents of the output space Y We will assue that the loss function L is definite, that is Ly, y ) = 0 iff y = y This assuption holds for all loss functions coonly used in structured prediction A key aspect of structured prediction is that the loss function can be decoposed along the substructures Y k As an exaple, L ay be the Haing loss defined by Ly, y ) = l l k= y k y k for all y = y,, y l ) and y = y,, y l ), with y k, y k Y k In the coon case where Y is a set of sequences defined over a finite alphabet, L ay be the edit-distance, which is widely used in natural language and speech processing applications, with possibly different costs associated to insertions, deletions and substitutions L ay also be a loss based on the negative inner product of the vectors of n-gra counts of two sequences, or its negative logarith Such losses have been

3 f f f 3 f a) Figure : Exaple of factor graphs a) Pairwise Markov network decoposition: hx, y) = h f x, y, y )+h f x, y, y 3 ) b) Other decoposition hx, y) = h f x, y, y 3 )+h f x, y, y, y 3 ) b) 3 used to approxiate the BLEU score loss in achine translation There are other losses defined in coputational biology based on various string-siilarity easures Our theoretical analysis is general and applies to arbitrary bounded and definite loss functions Scoring functions and factor graphs We will adopt the coon approach in structured prediction where predictions are based on a scoring function apping X Y to R Let H be a faily of scoring functions For any h H, we denote by h the predictor defined by h: for any x X, hx) = argax y Y hx, y) Furtherore, we will assue, as is standard in structured prediction, that each function h H can be decoposed as a su We will consider the ost general case for such decopositions, which can be ade explicit using the notion of factor graphs A factor graph G is a tuple G = V, F, E), where V is a set of variable nodes, F a set of factor nodes, and E a set of undirected edges between a variable node and a factor node In our context, V can be identified with the set of substructure indices, that is V = {,, l} For any factor node f, denote by Nf) V the set of variable nodes connected to f via an edge and define Y f as the substructure set cross-product Y f = k Nf) Y k Then, h adits the following decoposition as a su of functions h f, each taking as arguent an eleent of the input space x X and an eleent of Y f, y f Y f : hx, y) = f F h f x, y f ) ) Figure illustrates this definition with two different decopositions More generally, we will consider the setting in which a factor graph ay depend on a particular exaple x i, y i ): Gx i, y i ) = G i = l i, F i, E i ) A special case of this setting is for exaple when the size l i or length) of each exaple is allowed to vary and where the nuber of possible labels Y is potentially infinite We present other exaples of such hypothesis sets and their decoposition in Section 3, where we discuss our learning guarantees Note that such hypothesis sets H with an itive decoposition are those coonly used in ost structured prediction algoriths Tsochantaridis et al, 005, Taskar et al, 003, Lafferty et al, 00 This is largely otivated by the coputational requireent for efficient training and inference Our results, while very general, further provide a statistical learning otivation for such decopositions Learning scenario We consider the failiar supervised learning scenario where the training and test points are drawn iid according to soe distribution D over X Y We will further adopt the standard definitions of argin, generalization error and epirical error The argin h x, y) of a hypothesis h for a labeled exaple x, y) X Y is defined by h x, y) = hx, y) ax y y hx, y ) ) Let S = x, y ),, x, y )) be a training saple of size drawn fro D We denote by Rh) the generalization error and by R S h) the epirical error of h over S: Rh) = E Lhx), y) and RS h) = E Lhx), y), 3) x,y) D x,y) S Factor graphs are typically used to indicate the factorization of a probabilistic odel We are not assuing probabilistic odels, but they would be also captured by our general fraework: h would then be - log of a probability 3

4 where hx) = argax y hx, y) and where the notation x, y) S indicates that x, y) is drawn according to the epirical distribution defined by S The learning proble consists of using the saple S to select a hypothesis h H with sall expected loss Rh) Observe that the definiteness of the loss function iplies, for all x X, the following equality: Lhx), y) = Lhx), y) h x,y) 0 4) We will later use this identity in the derivation of surrogate loss functions 3 General learning bounds for structured prediction In this section, we present new learning guarantees for structured prediction Our analysis is general and applies to the broad faily of definite and bounded loss functions described in the previous section It is also general in the sense that it applies to general hypothesis sets and not just sub-failies of linear functions For linear hypotheses, we will give a ore refined analysis that holds for arbitrary nor-p regularized hypothesis sets The theoretical analysis of structured prediction is ore coplex than for classification since, by definition, it depends on the properties of the loss function and the factor graph These attributes capture the cobinatorial properties of the proble which ust be exploited since the total nuber of labels is often exponential in the size of that graph To tackle this proble, we first introduce a new coplexity tool 3 Coplexity easure A key ingredient of our analysis is a new data-dependent notion of coplexity that extends the classical Radeacher coplexity We define the epirical factor graph Radeacher coplexity R G S H) of a hypothesis set H for a saple S = x,, x ) and factor graph G as follows: R G S H) = E sup Fi ɛ i,f,y h f x i, y), ɛ f F i y Y f where ɛ = ɛ i,f,y ) i,f Fi,y Y f and where ɛ i,f,y s are independent Radeacher rando variables uniforly distributed over {±} The factor graph Radeacher coplexity of H for a factor graph G is defined as the expectation: R G H) = E S D RG S H) It can be shown that the epirical factor graph Radeacher coplexity is concentrated around its ean Lea 8) The factor graph Radeacher coplexity is a natural extension of the standard Radeacher coplexity to vectorvalued hypothesis sets with one coordinate per factor in our case) For binary classification, the factor graph and standard Radeacher coplexities coincide Otherwise, the factor graph coplexity can be upper bounded in ters of the standard one As with the standard Radeacher coplexity, the factor graph Radeacher coplexity of a hypothesis set can be estiated fro data in any cases In soe iportant cases, it also adits explicit upper bounds siilar to those for the standard Radeacher coplexity but with an itional dependence on the factor graph quantities We will prove this for several failies of functions which are coonly used in structured prediction Theore ) 3 Generalization bounds In this section, we present new argin bounds for structured prediction based on the factor graph Radeacher coplexity of H Our results hold both for the itive and the ultiplicative epirical argin losses defined below: R S,h) = E Φ ax x,y) S y y Ly, y) hx, y) hx, y ) ) 5) R S, ult h) = E Φ ax y y Ly, y) )) ) hx, y) hx, y 6) x,y) S Here, Φ r) = inm, ax0, r)) for all r, with M = ax y,y Ly, y ) As we show in Section 5, ult convex upper bounds on R S, h) and R S, h) directly lead to any existing structured prediction algoriths The following is our general data-dependent argin bound for structured prediction 4

5 Theore Fix > 0 For any δ > 0, with probability at least δ over the draw of a saple S of size, the following holds for all h H, Rh) R h) R S,h) + 4 log RG δ H) + M, Rh) R ult h) R ult S, h) + 4 M R G H) + M log δ The full proof of Theore is given in Appendix A It is based on a new contraction lea Lea 5) generalizing Talagrand s lea that can be of independent interest We also present a ore refined contraction lea Lea 6) that can be used to iprove the bounds of Theore Theore is the first data-dependent generalization guarantee for structured prediction with general loss functions, general hypothesis sets, and arbitrary factor graphs for both ultiplicative and itive argins We also present a version of this result with epirical coplexities as Theore 7 in the suppleentary aterial We will copare these guarantees to known special cases below The argin bounds above can be extended to hold uniforly over 0, at the price of an itional ter of the for log log )/ in the bound, using known techniques see for exaple Mohri et al, 0) The hypothesis set used by convex structured prediction algoriths such as StructSVM Tsochantaridis et al, 005, Max-Margin Markov Networks M3N) Taskar et al, 003 or Conditional Rando Field CRF) Lafferty et al, 00 is that of linear functions More precisely, let Ψ be a feature apping fro X Y) to R N such that Ψx, y) = f F Ψ f x, y f ) For any p, define H p as follows: H p = {x w Ψx, y): w R N, w p Λ p } Then, RG H p ) can be efficiently estiated using rando sapling and solving LP progras Moreover, one can obtain explicit upper bounds on R G H p ) To siplify our presentation, we will consider the case p =,, but our results can be extended to arbitrary p and, ore generally, to arbitrary group nors Theore For any saple S = x,, x ), the following upper bounds hold for the epirical factor graph coplexity of H and H : R G S H ) Λ r s logn), RG S H ) Λ r f F i y Y f F i, where r = ax i,f,y Ψ f x i, y), r = ax i,f,y Ψ f x i, y) and where s is a sparsity factor defined by s = ax j,n f F i y Y f F i Ψf,j x i,y) 0 Plugging in these factor graph coplexity upper bounds into Theore iediately yields explicit data-dependent structured prediction learning guarantees for linear hypotheses with general loss functions and arbitrary factor graphs see Corollary 0) Observe that, in the worst case, the sparsity factor can be bounded as follows: s F i y Y f f F i F i d i ax F i d i, i where d i = ax f Fi Y f Thus, the factor graph Radeacher coplexities of linear hypotheses in H scale as O logn) ax i F i d i /) An iportant observation is that F i and d i depend on the observed saple This shows that the expected size of the factor graph is crucial for learning in this scenario This should be contrasted with other existing structured prediction guarantees that we discuss below, which assue a fixed upper bound on the size of the factor graph Note that our result shows that learning is possible even with an infinite set Y To the best of our knowledge, this is the first learning guarantee for learning with infinitely any classes A result siilar to Lea 5 has also been recently proven independently in Maurer, 06 5

6 Our learning guarantee for H can itionally benefit fro the sparsity of the feature apping and observed data In particular, in any applications, Ψ f,j is a binary indicator function that is non-zero for a single x, y) X Y f For instance, in NLP, Ψ f,j ay indicate an occurrence of a certain n-gra in the input x i and output y i In this case, s = F i ax i F i and the coplexity ter is only in Oax i F i logn)/), where N ay depend linearly on d i 33 Special cases and coparisons Markov networks For the pairwise Markov networks with a fixed nuber of substructures l studied by Taskar et al 003, our equivalent factor graph adits l nodes, F i = l, and the axiu size of Y f is d i = k if each substructure of a pair can be assigned one of k classes Thus, if we apply Corollary 0 with Haing distance as our loss function and divide the bound through by l, to noralize the loss to interval 0, as in Taskar et al, 003, we obtain the following explicit for of our guarantee for an itive epirical argin loss, for all h H : Rh) R S,h) + 4Λ r k + 3 log δ This bound can be further iproved by eliinating the dependency on k using an extension of our contraction Lea 5 to, see Lea 6) The coplexity ter of Taskar et al 003 is bounded by a quantity that varies as Õ Λ q r /), where q is the axial out-degree of a factor graph Our bound has the sae dependence on these key quantities, but with no logarithic ter in our case Note that, unlike the result of Taskar et al 003, our bound also holds for general loss functions and different p-nor regularizers Moreover, our result for a ultiplicative epirical argin loss is new, even in this special case Multi-class classification For standard unstructured) ulti-class classification, we have F i = and d i = c, where c is the nuber of classes In that case, for linear hypotheses with nor- regularization, the coplexity ter of our bound varies as OΛ r c/ ) Corollary ) This iproves upon the best known general argin bounds of Kuznetsov et al 04, who provide a guarantee that scales linearly with the nuber of classes instead Moreover, in the special case where an individual w y is learned for each class y c, we retrieve the recent favorable bounds given by Lei et al 05, albeit with a soewhat sipler forulation In that case, for any x, y), all coponents of the feature vector Ψx, y) are zero, except perhaps) for the N coponents corresponding to class y, where N is the diension of w y In view of that, for exaple for a group-nor, - regularization, the coplexity ter of our bound varies as OΛr log c)/ ), which atches the results of Lei et al 05 with a logarithic dependency on c ignoring soe coplex exponents of log c in their case) Additionally, note that unlike existing ulti-class learning guarantees, our results hold for arbitrary loss functions See Corollary for further details Our sparsity-based bounds can also be used to give bounds with logarithic dependence on the nuber of classes when the features only take values in {0, } Finally, using Lea 6 instead of Lea 5, the dependency on the nuber of classes can be further iproved We conclude this section by observing that, since our guarantees are expressed in ters of the average size of the factor graph over a given saple, this invites us to search for a hypothesis set H and predictor h H such that the tradeoff between the epirical size of the factor graph and epirical error is optial In the next section, we will ake use of the recently developed principle of Voted Risk Miniization VRM) Cortes et al, 05 to reach this objective 4 Voted Risk Miniization In any structured prediction applications such as natural language processing and coputer vision, one ay wish to exploit very rich features However, the use of rich failies of hypotheses could lead to overfitting In this section, we show that it ay be possible to use rich failies in conjunction with sipler failies, provided that fewer coplex hypotheses are used or that they are used with less ixture weight) We achieve this goal by deriving learning guarantees for ensebles of structured prediction rules that explicitly account for the differing coplexities between failies This will otivate the algoriths that we present in Section 5 6

7 Assue that we are given p failies H,, H p of functions apping fro X Y to R Define the enseble faily F = conv p k= H k), that is the faily of functions f of the for f = T t= α th t, where α = α,, α T ) is in the siplex and where, for each t, T, h t is in H kt for soe k t, p We further assue that R G H ) R G H ) R G H p ) As an exaple, the H k s ay be ordered by the size of the corresponding factor graphs The ain result of this section is a generalization of the VRM theory to the structured prediction setting The learning guarantees that we present are in ters of upper bounds on h) and R ult S, h), which are defined as follows for all τ 0: R S,,τ h) = E Φ ax R ult S,,τ h) = x,y) S E x,y) S R S, y y Ly, y) + τ hx, y) hx, y ) ) 7) + τ )) ) hx, y) hx, y 8) Φ ax y y Ly, y) Here, τ can be interpreted as a argin ter that acts in conjunction with For siplicity, we assue in this section that Y = c < + Theore 3 Fix > 0 For any δ > 0, with probability at least δ over the draw of a saple S of size, each of the following inequalities holds for all f F: Rf) R S,,f) 4 T α t R G H kt ) + C, M, c,, p), Rf) where C, M, c,, p) = M t= ult R S,,f) 4 M T α t R G H kt ) + C, M, c,, p), t= log p + 3M 4 log ) c log p 4 log p + log δ The proof of this theore crucially depends on the theory we developed in Section 3 and is given in Appendix A As with Theore, we also present a version of this result with epirical coplexities as Theore 4 in the suppleentary aterial The explicit dependence of this bound on the paraeter vector α suggests that learning even with highly coplex hypothesis sets could be possible so long as the coplexity ter, which is a weighted average of the factor graph coplexities, is not too large The theore provides a quantitative way of deterining the ixture weights that should be apportioned to each faily Furtherore, the dependency on the nuber of distinct feature ap failies H k is very ild and therefore suggests that a large nuber of failies can be used These properties will be useful for otivating new algoriths for structured prediction 5 Algoriths In this section, we derive several algoriths for structured prediction based on the VRM principle discussed in Section 4 We first give general convex upper bounds Section 5) on the structured prediction loss which recover as special cases the loss functions used in StructSVM Tsochantaridis et al, 005, Max-Margin Markov Networks M3N) Taskar et al, 003, and Conditional Rando Field CRF) Lafferty et al, 00 Next, we introduce a new algorith, Voted Conditional Rando Field VCRF) Section 5, with accopanying experients as proof of concept We also present another algorith, Voted StructBoost VStructBoost), in Appendix C 5 General fraework for convex surrogate losses Given x, y) X Y, the apping h Lhx), y) is typically not a convex function of h, which leads to coputationally hard optiization probles This otivates the use of convex surrogate losses We first introduce a general forulation of surrogate losses for structured prediction probles Lea 4 For any u R +, let Φ u : R R be an upper bound on v u v 0 Then, the following upper bound holds for any h H and x, y) X Y, Lhx), y) ax y y Φ Ly,y)hx, y) hx, y )) 9) 7

8 The proof is given in Appendix A This result defines a general fraework that enables us to straightforwardly recover any of the ost coon state-of-the-art structured prediction algoriths via suitable choices of Φ u v): a) for Φ u v) = ax0, u v)), the right-hand side of 9) coincides with the surrogate loss defining StructSVM Tsochantaridis et al, 005; b) for Φ u v) = ax0, u v), it coincides with the surrogate loss defining Max-Margin Markov Networks M3N) Taskar et al, 003 when using for L the Haing loss; and c) for Φ u v) = log + e u v ), it coincides with the surrogate loss defining the Conditional Rando Field CRF) Lafferty et al, 00 Moreover, alternative choices of Φ u v) can help define new algoriths In particular, we will refer to the algorith based on the surrogate loss defined by Φ u v) = ue v as StructBoost, in reference to the exponential loss used in AdaBoost Another related alternative is based on the choice Φ u v) = e u v See Appendix C, for further details on this algorith In fact, for each Φ u v) described above, the corresponding convex surrogate is an upper bound on either the ultiplicative or itive argin loss introduced in Section 3 Therefore, each of these algoriths seeks a hypothesis that iniizes the generalization bounds presented in Section 3 To the best of our knowledge, this interpretation of these well-known structured prediction algoriths is also new In what follows, we derive new structured prediction algoriths that iniize finer generalization bounds presented in Section 4 5 Voted Conditional Rando Field VCRF) We first consider the convex surrogate loss based on Φ u v) = log + e u v ), which corresponds to the loss defining CRF odels Using the onotonicity of the logarith and upper bounding the axiu by a su gives the following upper bound on the surrogate loss holds: ax log + ) w Ψx,y) Ψx,y )) ) log y y ely,y ) e Ly,y ) w Ψx,y) Ψx,y )), which, cobined with VRM principle leads to the following optiization proble: ) p in log e Ly,yi) w Ψxi,yi) Ψxi,y)) + λr k + β) w k, 0) w y Y where r k = r F k) log N We refer to the learning algorith based on the optiization proble 0) as VCRF Note that for λ = 0, 0) coincides with the objective function of L - regularized CRF Observe that we can also directly use ax y y log + e Ly,y ) w δψx,y,y ) ) or its upper bound y y log + ) w δψx,y,y ) ) as a convex surrogate We can siilarly derive ely,y an L -regularization forulation of the VCRF algorith In Appendix D, we describe efficient algoriths for solving the VCRF and VStructBoost optiization probles 6 Experients In Appendix B, we corroborate our theory by reporting experiental results suggesting that the VCRF algorith can outperfor the CRF algorith on a nuber of part-of-speech POS) datasets 7 Conclusion We presented a general theoretical analysis of structured prediction Our data-dependent argin guarantees for structured prediction can be used to guide the design of new algoriths or to derive guarantees for existing ones Its explicit dependency on the properties of the factor graph and on feature sparsity can help shed new light on the role played by the graph and features in generalization Our extension of the VRM theory to structured prediction provides a new analysis of generalization when using a very rich set of features, which is coon in applications such as natural language processing and leads to new algoriths, VCRF and VStructBoost Our experiental results for VCRF serve as a proof of concept and otivate ore extensive epirical studies of these algoriths Acknowledgents This work was partly funded by NSF CCF and IIS-6866 and NSF GRFP DGE y Y k= 8

9 References G H Bakir, T Hofann, B Schölkopf, A J Sola, B Taskar, and S V N Vishwanathan Predicting Structured Data Neural Inforation Processing) The MIT Press, 007 K Chang, A Krishnaurthy, A Agarwal, H Daué III, and J Langford Learning to search better than your teacher In ICML, 05 M Collins Paraeter estiation for statistical parsing odels: Theory and practice of distribution-free ethods In Proceedings of IWPT, 00 C Cortes, M Mohri, and J Weston A General Regression Fraework for Learning String-to-String Mappings In Predicting Structured Data MIT Press, 007 C Cortes, V Kuznetsov, and M Mohri Enseble ethods for structured prediction In ICML, 04 C Cortes, P Goyal, V Kuznetsov, and M Mohri Kernel extraction via voted risk iniization JMLR, 05 H Daué III, J Langford, and D Marcu Search-based structured prediction Machine Learning, 753): 97 35, 009 J R Doppa, A Fern, and P Tadepalli Structured prediction via output space search JMLR, 5):37 350, 04 D Jurafsky and J H Martin Speech and Language Processing nd Edition) Prentice-Hall, Inc, 009 V Kuznetsov, M Mohri, and U Syed Multi-class deep boosting In NIPS, 04 J Lafferty, A McCallu, and F Pereira Conditional rando fields: Probabilistic odels for segenting and labeling sequence data In ICML, 00 M La, J R Doppa, S Todorovic, and T G Dietterich Hc-search for structured prediction in coputer vision In CVPR, 05 Y Lei, Ü D Dogan, A Binder, and M Kloft Multi-class svs: Fro tighter data-dependent generalization bounds to novel algoriths In NIPS, 05 A Lucchi, L Yunpeng, and P Fua Learning for structured prediction using approxiate subgradient descent with working sets In CVPR, 03 A Maurer A vector-contraction inequality for radeacher coplexities In ALT, 06 D McAllester Generalization bounds and consistency for structured labeling In Predicting Structured Data MIT Press, 007 M Mohri, A Rostaizadeh, and A Talwalkar Foundations of Machine Learning The MIT Press, 0 D Nadeau and S Sekine A survey of naed entity recognition and classification Linguisticae Investigationes, 30):3 6, January 007 S Ross, G J Gordon, and D Bagnell A reduction of iitation learning and structured prediction to no-regret online learning In AISTATS, 0 B Taskar, C Guestrin, and D Koller Max-argin Markov networks In NIPS, 003 I Tsochantaridis, T Joachis, T Hofann, and Y Altun Large argin ethods for structured and interdependent output variables JMLR, 6: , Dec 005 O Vinyals, L Kaiser, T Koo, S Petrov, I Sutskever, and G Hinton Graar as a foreign language In NIPS, 05a O Vinyals, A Toshev, S Bengio, and D Erhan Show and tell: A neural iage caption generator In CVPR, 05b D Zhang, L Sun, and W Li A structured prediction approach for statistical achine translation In IJCNLP, 008 9

10 A Proofs This appendix section gathers detailed proofs of all of our ain results In Appendix A, we prove a contraction lea used as a tool in the proof of our general factor graph Radeacher coplexity bounds Appendix A3) In Appendix A8, we further extend our bounds to the Voted Risk Miniization setting Appendix A5 gives explicit upper bounds on the factor graph Radeacher coplexity of several coonly used hypothesis sets In Appendix A9, we prove a general upper bound on a loss function used in structured prediction in ters of a convex surrogate A Contraction lea The following contraction lea will be a key tool used in the proofs of our generalization bounds for structured prediction Lea 5 Let H be a hypothesis set of functions apping X to R c Assue that for all i =,,, Ψ i : R c R is µ i -Lipschitz for R c equipped with the -nor That is: Ψ i x ) Ψ i x) µ i x x, for all x, x ) R c ) Then, for any saple S of points x,, x X, the following inequality holds c E sup σ i Ψ i hx i )) σ E sup ɛ ij µ i h j x i ), ) ɛ where ɛ = ɛ ij ) i,j and ɛ ij s are independent Radeacher variables uniforly distributed over {±} Proof Fix a saple S = x,, x ) Then, we can rewrite the left-hand side of ) as follows: E sup σ i Ψ i hx i )) = E E sup U h) + σ Ψ hx )), σ σ,,σ σ where U h) = σ iψ i hx i )) Assue that the suprea can be attained and let h, h H be the hypotheses satisfying U h ) + Ψ h x )) = sup U h) + Ψ hx )) U h ) Ψ h x )) = sup U h) Ψ hx )) When the suprea are not reached, a siilar arguent to what follows can be given by considering instead hypotheses that are ɛ-close to the suprea for any ɛ > 0 By definition of expectation, since σ is uniforly distributed over {±}, we can write E sup U h) + σ Ψ hx )) σ = sup U h) + Ψ hx )) + sup U h) Ψ hx )) = U h ) + Ψ h x )) + U h ) Ψ h x )) Next, using the µ -Lipschitzness of Ψ and the Khintchine-Kahane inequality, we can write E sup U h) + σ Ψ hx )) σ U h ) + U h ) + µ h x ) h x ) c U h ) + U h ) + µ E ɛ j hj x ) h j x ) ) ɛ,,ɛ c 0

11 Now, let ɛ denote ɛ,, ɛ c ) and let sɛ ) {±} denote the sign of c ɛ j hj x ) h j x ) ) Then, the following holds: E sup U h) + σ Ψ h)x ) σ E c U h ) + U h ) + µ ɛ j hj x ) h j x ) ) ɛ = E ɛ U h ) + µ sɛ ) E sup ɛ c ɛ j h j x ) + U h ) µ sɛ ) c U h) + µ sɛ ) ɛ j h j x ) c ) ɛ j h j x ) c ) + sup U h) µ sɛ ) ɛ j h j x ) c = E E sup U h) + µ σ ɛ j h j x ) ɛ σ = E sup U h) + µ ɛ c ɛ j h j x ), Proceeding in the sae way for all other σ i s i < ) copletes the proof A Contraction lea for, -nor In this section, we present an extension of the contraction Lea 5, that can be used to reove the dependency on the alphabet size in all of our bounds Lea 6 Let H be a hypothesis set of functions apping X d to R c Assue that for all i =,,, Ψ i is µ i -Lipschitz for R c d equipped with the nor-, ) for soe µ i > 0 That is Ψ i x ) Ψ i x) µ i x x,, for all x, x ) R c d ) Then, for any saple S of points x,, x X, there exists a distribution U over d c such that the following inequality holds: c E sup σ i Ψ i hx i )) σ E sup ɛ ij µ i h j x i, υ j ), ) υ U,ɛ where ɛ = ɛ ij ) i,j and ɛ ij s are independent Radeacher variables uniforly distributed over {±} and υ = υ i,j ) i,j is a sequence of rando variables distributed according to U Note that υ i,j s theselves do not need to be independent Proof Fix a saple S = x,, x ) Then, we can rewrite the left-hand side of ) as follows: E sup σ i Ψ i hx i )) = E E sup U h) + σ Ψ hx )), σ σ,,σ σ where U h) = σ iψ i hx i )) Assue that the suprea can be attained and let h, h H be the hypotheses satisfying U h ) + Ψ h x )) = sup U h) + Ψ hx )) U h ) Ψ h x )) = sup U h) Ψ hx ))

12 When the suprea are not reached, a siilar arguent to what follows can be given by considering instead hypotheses that are ɛ-close to the suprea for any ɛ > 0 By definition of expectation, since σ is uniforly distributed over {±}, we can write E sup σ U h) + σ Ψ h x )) = sup U h) + Ψ h x )) + sup U h) Ψ hx )) = U h ) + Ψ h x )) + U h ) Ψ h x )) Next, using the µ -Lipschitzness of Ψ and the Khintchine-Kahane inequality, we can write E sup σ U h) + σ Ψ h)x ) U h ) + U h ) + µ h x ) h x ), c U h ) + U h ) + µ E ɛ j h,j x, ) h,j x, ) ɛ,,ɛ c Define the rando variables υ j = υ j σ) = argax k d h,j x, k) h,j x, k) Now, let ɛ denote ɛ,, ɛ c ) and let sɛ ) {±} denote the sign of c ɛ j h,j x, ) h,j x, ) Then, the following holds: E sup σ U h) + σ Ψ h)x ) E c U h ) + U h ) + µ ɛ j h,j x, ) h,j x, ) ɛ E ɛ U h ) + U h ) + µ sɛ ) c = E ɛ U h ) + U h ) + µ sɛ ) c = E ɛ U h ) + µ sɛ ) ɛ j h,j x, υ j ) h,j x, υ j ) ɛ j h,j x, υ j ) h,j x, υ j )) c ɛ j h,j x, υ j ) + U h ) µ sɛ ) c ɛ j h,j x, υ j )

13 After taking expectation over υ, the rest of the proof proceeds the sae way as the arguent in Lea 5: c E U h ) + µ sɛ ) ɛ j h,j x, υ j ) υ U,ɛ c + U h ) µ sɛ ) ɛ j h,j x, υ j ) E sup U h) + µ sɛ ) υ U,ɛ + sup U h) µ sɛ ) = E E sup U h) + µ σ υ U,ɛ σ c = E sup U h) + µ υ U,ɛ c ) ɛ j h j x, υ j ) c ) ɛ j h j x, υ j ) c ɛ j h j x, υ j ) ɛ j h j x, υ j ), Proceeding in the sae way for all other σ i s i < ) copletes the proof A3 General structured prediction learning bounds In this section, we give the proof of several general structured prediction bounds in ters of the notion of factor graph Radeacher coplexity We will use the itive and ultiplicative argin losses of a hypothesis h, which are the population versions of the epirical argin losses we introduced in 5) and 6) and are defined as follows: R h) = E x,y) D R ult h) = E x,y) D Φ ax hx, y) hx, y ) ) y y Ly, y) )) ) hx, y) hx, y Φ ax y y Ly, y) The following is our general argin bound for structured prediction Theore Fix > 0 For any δ > 0, with probability at least δ over the draw of a saple S of size, the following holds for all h H, Rh) R h) R S,h) + 4 log RG δ H) + M, Rh) R ult h) R ult S, h) + 4 M R G H) + M log δ Proof Let Φ u v) = Φ u v ), where Φ r) = inm, ax0, r)) Observe that for any u 0, M, u v 0 Φ u v) for all v Therefore, by Lea 4 and onotonicity of Φ, Rh) E ax Φ x,y) D y Ly y,y)hx, y) hx, y )) = E Φ ax Ly, y) hx, y) hx, y ) ) ) x,y) D y y = R h) 3

14 Define H 0 = H = { x, y) Φ ax y y { x, y) ax y y Ly, y) hx, y) hx, y ) Ly, y) hx, y) hx, y ) } )) : h H, } ) : h H By standard Radeacher coplexity bounds Koltchinskii and Panchenko 00), for any δ > 0, with probability at least δ, the following inequality holds for all h H: R log h) R δ S,h) + R H 0 ) + M, where R H 0 ) is the Radeacher coplexity of the faily H 0 : R H 0 ) = E E sup σ i Φ ax Ly, y i ) hx i, y i ) hx i, y ) )) S D σ y y i and where σ = σ,, σ ) with σ i s independent Radeacher rando variables uniforly distributed over {±} Since Φ is -Lipschitz, by Talagrand s contraction lea Ledoux and Talagrand 99, Mohri et al 0), we have R S H 0 ) R S H ) By taking an expectation over S, this inequality carries over to the true Radeacher coplexities as well Now, observe that by the sub-itivity of the supreu, the following holds: R S H ) E σ + E σ sup sup σ i ax y y i σ i hx i, y i ) Ly, y i ) + hx i, y ) ) where we also used for the last ter the fact that σ i and σ i adit the sae distribution We use Lea 5 to bound each of the two ters appearing on the right-hand ) side separately To do so, we we first show the Lipschitzness of h ax y y i Ly, y i ) + hxi,y ) Observe that the following chain of inequalities holds for any h, h H: ) ax Ly, y i ) + hx i, y) ax y y i y y i, Ly, y i ) + hx i, y) ) ax hxi, y) hx i, y) y y i ax hx i, y) hx i, y) y Y = ax h f x i, y f ) h f x i, y f )) y Y f F i ax h f x i, y f ) h f x i, y f )) y Y f F i = hf ax x i, y) h f x i, y)) y Y f f F i Fi ax h f x i, y) h f x i, y)) y Y f f F i Fi = ax h f x i, y) h f x i, y)) y Y f f F i Fi h f x i, y) h f x i, y)) y Y f f F i 4

15 We can therefore apply Lea 5, which yields E sup σ i ax Ly, y i ) + hx i, y ) ) σ y y i E Fi sup h f x i, y) = ɛ ɛ i,f,y f F i y Y f R G S H) Siilarly, for the second ter, observe that the following Lipschitz property holds: hx i, y i ) hx i, y i ) ax hx i, y) hx i, y) y Y Fi h f x i, y) h f x i, y)) f F i y Y We can therefore apply Lea 5 and obtain the following: E hx i, y i ) sup σ i σ E sup ɛ ɛ i,f,y f F i y Y f Fi h f x i, y) = R G S H) Taking the expectation over S of the two inequalities shows that R H ) RG H), which copletes the proof of the first stateent The second stateent can be proven in a siilar way with Φ u v) = Φ u v )) In particular, by standard Radeacher coplexity bounds, McDiarid s inequality, and Talagrand s contraction lea, we can write where H = { R ult h) x, y) ax y y Ly, y) We observe that the following inequality holds: ) ax Ly, y i ) hx i, y i ) hx i, y) y y i R ult S, h) + R H ) + M log δ, hx, y) hx, y ) ax y y i Ly, y i ) M ax y Y } ) : h H hx i, y i ) hx ) i, y) hx i, y) hx i, y) Then, the rest of the proof follows fro Lea 5 as in the previous arguent In the proof above, we could have applied McDiarid s inequality to bound the Radeacher coplexity of H 0 by its epirical counterpart at the cost of slightly increasing the exponential concentration ter: R h) R S,h) + R S H 0 ) + 3M log δ Since Talagrand s contraction lea holds for epirical Radeacher coplexities and the reainder of the proof involves bounding the epirical Radeacher coplexity of H before taking an expectation over the saple at the end, we can apply the sae arguents without the final expectation to arrive at the following analogue of Theore in ters of epirical coplexities: 5

16 Theore 7 Fix > 0 For any δ > 0, with probability at least δ over the draw of a saple S of size, the following holds for all h H, Rh) R h) R S,h) + 4 R G log δ S H) + 3M, Rh) R ult h) ult R S, h) + 4 M R G S H) + 3M log δ This theore will be useful for any of our applications, which are based on bounding the epirical factor graph Radeacher coplexity for different hypothesis classes A4 Concentration of the epirical factor graph Radeacher coplexity In this section, we show that, as with the standard notion of Radeacher coplexity, the epirical factor graph Radeacher coplexity also concentrates around its ean Lea 8 Let H be a faily of scoring functions apping X Y R bounded by a constant C Let S be a training saple of size drawn iid according to soe distribution D on X Y, and let D X be the arginal distribution on X For any point x X, let F x denote its associated set of factor nodes Then, with probability at least δ over the draw of saple S D, R G S H) R G H) C sup x suppd X ) Y f Fx f F x log δ Proof Let S = x, x,, x ) and S = x, x,, x ) be two saples differing by one point x j and x j ie x i = x i for i j) Then R G S H) R G S H) E sup Fi ɛ i,f,y h f x i, y) ɛ E ɛ = E ɛ f F i y Y f sup Fi ɛ i,f,y h f x i, y) f F i y Y f sup f F xj y Y f sup f F x y Y j f sup x suppd X ) f F x y Y f F j ɛ j,f,y h f x j, y) F x j ɛ j,f,y h f x j, y) Fx h f x, y) The sae upper bound also holds for R G S H) R G S H) The result now follows fro McDiarid s inequality A5 Bounds on the factor graph Radeacher coplexity The following lea is a standard bound on the expectation of the axiu of n zero-ean bounded rando variables, which will be used in the proof of our bounds on factor graph Radeacher coplexity Lea 9 Let X X n be n real-valued rando variables such that for all j, n, X j = j Y ij where, for each fixed j, n, Y ij are independent zero ean rando variables with Y ij t ij Then, the following inequality holds: E ax X j t log n, j,n 6

17 j with t = ax j,n t ij The following are upper bounds on the factor graph Radeacher coplexity for H and H, as defined in Section 3 Siilar guarantees can be given for other hypothesis sets H p with p > Theore For any saple S = x,, x ), the following upper bounds hold for the epirical factor graph coplexity of H and H : R G S H ) Λ r s logn), RG S H ) Λ r f F i y Y f F i, where r = ax i,f,y Ψ f x i, y), r = ax i,f,y Ψ f x i, y) and where s is a sparsity factor defined by s = ax j,n f F i y Y f F i Ψjx i,y) 0 Proof By definition of the dual nor and Lea 9 or Massart s lea), the following holds: R G S H ) = E sup w Fi ɛ i,f,y Ψ f x i, y) ɛ w Λ f F i y Y f = Λ E Fi ɛ i,f,y Ψ f x i, y) ɛ f F i y Y f = Λ E ax σ Fi ɛ i,f,y Ψ f,j x i, y) ɛ j,n,σ {,+} f F i y Y f = Λ E ax σ Fi ɛ i,f,y Ψ f,j x i, y) Ψf,jx ɛ i,y) 0 j,n,σ {,+} Λ ax j,n f F i = Λ r s logn), f F i y Y f ) F i Ψjx i,y) 0 r logn) y Y f which copletes the proof of the first stateent The second stateent can be proven in a siilar way using the the definition of the dual nor and Jensen s inequality: R G S H ) = E sup w Fi ɛ i,f,y Ψ f x i, y) ɛ w Λ which concludes the proof f F i y Y f f F i y Y f = Λ E Fi ɛ i,f,y Ψ f x i, y) ɛ ) = Λ E Fi ɛ i,f,y Ψ f x i, y) ɛ f F i f F i y Y f = Λ Λ r f F i y Y f F i Ψ f x i, y) y Y f F i, A6 Learning guarantees for structured prediction with linear hypotheses The following result is a direct consequence of Theore 7 and Theore ) 7

18 Corollary 0 Fix > 0 For any δ > 0, with probability at least δ over the draw of a saple S of size, the following holds for all h H, Rh) R S,h) + 4 Λ log r s logn) + 3M δ, ult Rh) R S, h) + 4 M Λ log r s logn) + 3M δ Siilarly, for any δ > 0, with probability at least δ over the draw of a saple S of size, the following holds for all h H, Rh) Rh) R S,h) + 4 Λ r f F i y Y f F i + 3M ult R S, h) + 4 M Λ r f F i y Y f F i + 3M log δ, A7 Learning guarantees for ulti-class classification with linear hypotheses log δ The following result is a direct consequence of Corollary 0 and the observation that for ulti-class classification F i = and d i = ax f Fi Y f = c Note that our ulti-class learning guarantees hold for arbitrary bounded losses To the best of our knowledge this is a novel result in this setting In particular, these guarantees apply to the special case of the standard ulti-class zero-one loss Ly, y ) = {y y } which is bounded by M = Corollary Fix > 0 For any δ > 0, with probability at least δ over the draw of a saple S of size, the following holds for all h H, Rh) R S,h) + 4 Λ r c logn) log δ + 3M, ult Rh) R S, h) + 4 Λ r c logn) log δ + 3M Siilarly, for any δ > 0, with probability at least δ over the draw of a saple S of size, the following holds for all h H, Rh) R S,h) + 4 Λ r c + 3M log δ, Rh) ult R S, h) + 4 Λ r c + 3M log δ Consider the following set of linear hypothesis: H, = {x w Ψx, y): w, Λ,, y c}, where Ψx, y) = 0, 0, Ψ y x), 0,, 0) T R N,Nc and w = w,, w c ) with w, = c y= w y In this case, w Ψx, y) = w y Ψ y x) The standard scenario in ulti-class classification is when Ψ y x) = Ψx) is the sae for all y Corollary Fix > 0 For any δ > 0, with probability at least δ over the draw of a saple S of size, the following holds for all h H,, log δ Rh), R S,h) + 6Λ,r, logc)) /4 + 3M ult Rh) R S, h) + 6Λ,r, logc)) /4 where r, = ax i,y Ψ y x i ) + 3M log δ, 8

19 Proof By definition of the dual nor and H,, the following holds: R G S H, ) = E sup w ɛ i,y Ψx i, y) ɛ w, Λ y c = Λ E ɛ i,y Ψx i, y) ɛ, = Λ E ɛ ax y y c ɛ i,y Ψ y x i ) ) / Λ Eɛ ax y ɛ i,y Ψ y x i ) = Λ Λ Eɛ ax y ax y Ψ yx i ) + i j Ψ yx i ) + E ax ɛ y ) / ɛ i,y ɛ j,y Ψ y x i ) Ψ y x j ) i j ) / ɛ i,y ɛ j,y Ψ y x i ) Ψ y x j ) By Lea 9 or Massart s lea), the following bound holds: E ax ɛ i,y ɛ j,y Ψ y x i ) Ψ y x j ) r, logc) ɛ y i j Since, ax y Ψ yx i ) r,, we obtain that the following result holds: R G Λr, logc)) /4 S H, ), and applying Theore copletes the proof A8 VRM structured prediction learning bounds Here, we give the proof of our structured prediction learning guarantees in the setting of Voted Risk Miniization We will use the following lea Lea 3 The function Φ is sub-itive: Φ x + y) Φ x) + Φ y), for all x, y R Proof By the sub-itivity of the axiu function, for any x, y R, the following upper bound holds for Φ x + y): Φ x + y) = inm, ax0, x + y)) inm, ax0, x) + ax0, y)) inm, ax0, x)) + inm, ax0, y)) = Φ x) + Φ y), which copletes the proof For the following proof, for any τ 0, the argin losses R,τ h) and R,τ ult h) are defined as the population counterparts of the epirical losses define by 7) and 8) Theore 3 Fix > 0 For any δ > 0, with probability at least δ over the draw of a saple S of size, each of the following inequalities holds for all f F: Rf) R S,,f) 4 T α t R G H kt ) + C, M, c,, p), Rf) t= R ult S,,f) 4 M T α t R G H kt ) + C, M, c,, p) t= 9

20 where C, M, c,, p) = M log p + 3M 4 log ) c log p 4 log p + log δ Proof The proof akes use of Theore and the proof techniques of Kuznetsov et al 04Theore but requires a finer analysis both because of the general loss functions used here and because of the ore coplex structure of the hypothesis set For a fixed h = h,, h T ), any α in the probability siplex defines a distribution over {h,, h T } Sapling fro {h,, h T } according to α and averaging leads to functions g of the for g = T n n th t for soe n = n,, n T ) N T, with T t= n t = n, and h t H kt For any N = N,, N p ) with N = n, we consider the faily of functions G F,N = { n p N k h k,j k, j) p N k, h k,j H k }, k= and the union of all such failies G F,n = N =n G F,N Fix > 0 For a fixed N, the epirical factor graph Radeacher coplexity of G F,N can be bounded as follows for any : R G S G F,N ) n p N k RG S H k ), which also iplies the result for the true factor graph Radeacher coplexities Thus, by Theore, the following learning bound holds: for any δ > 0, with probability at least δ, for all g G F,N, R, g) R S,, g) 4 p N k R G log δ n H k ) + M k= Since there are at ost p n possible p-tuples N with N = n, 3 by the union bound, for any δ > 0, with probability at least δ, for all g G F,n, we can write R, g) R S,, g) 4 p N k R G log pn δ n H k ) + M k= k= Thus, with probability at least δ, for all functions g = T n n th t with h t H kt, the following inequality holds R, g) R S,, g) 4 p n t R G log pn δ n H kt ) + M k= t:k t=k Taking the expectation with respect to α and using E α n t /n = α t, we obtain that for any δ > 0, with probability at least δ, for all g, we can write ER α, g) R S,, g) 4 T α t R G log pn δ H kt ) + M t= Fix n Then, for any δ n > 0, with probability at least δ n, ER α, g) R S,, g) 4 T log pn α t R G δ H kt ) + M n 3 The nuber Sp, n) of p-tuples N with N = n is known to be precisely ) p+n p t= 0

21 Choose δ n = δ p for soe δ > 0, then for p, n n δ δ n = /p) δ Thus, for any δ > 0 and any n, with probability at least δ, the following holds for all g: ER α, g) R S,, g) 4 T α t R G log pn δ H kt ) + M 3) t= Now, for any f = T t= α th t F and any g = T n n th t, using 4), we can upper bound Rf), the generalization error of f, as follows: Rf) = E Lfx), y) f x,y) 0 4) E Lfx), y) f x,y) gx,y) gx,y f ))< / M Pr f x, y) gx, y) gx, y f )) < / + E Lfx), y) gx,y) gx,yf ) / + E Lfx), y) gx,y) gx,yf ) /, where for any function ϕ: X Y 0,, we define y ϕ as follows: y ϕ = argax ϕx, y) y y Using the sae arguents as in the proof of Lea 4, one can show that E Lfx), y)) gx,y) gx,yf )</ R, g) We now give a lower-bound on of R S,, g): R S,, f) in ters of R g) To do so, we start with the expression S,, R S,, g) = E Φ ax x,y) S y y Ly, y) + gx, y) gx, y ) ) By the sub-itivity of ax, we can write ax y y Ly, y) + gx, y) gx, y ) { } ax Ly, y ) + fx, y) fx, y ) y y { } + ax y y + fx, y) fx, y ) gx, y) gx, y ) = X + Y, where X and Y are defined by ) X = ax Ly, y ) + fx, y) fx, y ), y y Y = + ax y y fx, y) fx, y ) gx, y) gx, y ) In view of that, since Φ is non-decreasing and sub-itive Lea 3), we can write R S,, g) E x,y) S Φ X + Y ) 5) = = E x,y) S Φ X) + Φ Y ) = R S,,f) + R S,,f) + M R S,,f) + M E x,y) S Φ Y ) E x,y) S Y >0 Pr x,y) S ax y y E x,y) S Φ X) + E x,y) S Φ Y ) { fx, y) gx, y) + gx, y ) fx, y )) } > / )

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

Structured Prediction Theory and Algorithms

Structured Prediction Theory and Algorithms Structured Prediction Theory and Algorithms Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE

More information

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic

More information

1 Rademacher Complexity Bounds

1 Rademacher Complexity Bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability

More information

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where

More information

Support Vector Machines. Goals for the lecture

Support Vector Machines. Goals for the lecture Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed

More information

Accuracy at the Top. Abstract

Accuracy at the Top. Abstract Accuracy at the Top Stephen Boyd Stanford University Packard 264 Stanford, CA 94305 boyd@stanford.edu Mehryar Mohri Courant Institute and Google 25 Mercer Street New York, NY 002 ohri@cis.nyu.edu Corinna

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung

More information

CS Lecture 13. More Maximum Likelihood

CS Lecture 13. More Maximum Likelihood CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood

More information

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

e-companion ONLY AVAILABLE IN ELECTRONIC FORM OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

PAC-Bayes Analysis Of Maximum Entropy Learning

PAC-Bayes Analysis Of Maximum Entropy Learning PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

Stability Bounds for Non-i.i.d. Processes

Stability Bounds for Non-i.i.d. Processes tability Bounds for Non-i.i.d. Processes Mehryar Mohri Courant Institute of Matheatical ciences and Google Research 25 Mercer treet New York, NY 002 ohri@cis.nyu.edu Afshin Rostaiadeh Departent of Coputer

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

1 Generalization bounds based on Rademacher complexity

1 Generalization bounds based on Rademacher complexity COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges

More information

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison yströ Method vs : A Theoretical and Epirical Coparison Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou Machine Learning Lab, GE Global Research, San Raon, CA 94583 Michigan State University,

More information

Understanding Machine Learning Solution Manual

Understanding Machine Learning Solution Manual Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, 2014 2 Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

Symmetrization and Rademacher Averages

Symmetrization and Rademacher Averages Stat 928: Statistical Learning Theory Lecture: Syetrization and Radeacher Averages Instructor: Sha Kakade Radeacher Averages Recall that we are interested in bounding the difference between epirical and

More information

Machine Learning Basics: Estimators, Bias and Variance

Machine Learning Basics: Estimators, Bias and Variance Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics

More information

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

On the Impact of Kernel Approximation on Learning Accuracy

On the Impact of Kernel Approximation on Learning Accuracy On the Ipact of Kernel Approxiation on Learning Accuracy Corinna Cortes Mehryar Mohri Aeet Talwalkar Google Research New York, NY corinna@google.co Courant Institute and Google Research New York, NY ohri@cs.nyu.edu

More information

Feature Extraction Techniques

Feature Extraction Techniques Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Journal of Machine Learning Research 5 (2004) 529-547 Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago

More information

Transductive Learning of Structural SVMs via Prior Knowledge Constraints

Transductive Learning of Structural SVMs via Prior Knowledge Constraints Transductive Learning of Structural SVMs via Prior Knowledge Constraints Chun-Na Yu Dept. of Coputing Science & AICML University of Alberta, AB, Canada chunna@cs.ualberta.ca Abstract Reducing the nuber

More information

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material Consistent Multiclass Algoriths for Coplex Perforance Measures Suppleentary Material Notations. Let λ be the base easure over n given by the unifor rando variable (say U over n. Hence, for all easurable

More information

Learning with Deep Cascades

Learning with Deep Cascades Learning with Deep Cascades Giulia DeSalvo 1, Mehryar Mohri 1,2, and Uar Syed 2 1 Courant Institute of Matheatical Sciences, 251 Mercer Street, New Yor, NY 10012 2 Google Research, 111 8th Avenue, New

More information

Boosting with log-loss

Boosting with log-loss Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the

More information

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes Journal of Machine Learning Research (200 66-686 Subitted /08; Revised /0; Published 2/0 Stability Bounds for Stationary ϕ-ixing and β-ixing Processes Mehryar Mohri Courant Institute of Matheatical Sciences

More information

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes Journal of Machine Learning Research (200) 789-84 Subitted /08; Revised /0; Published 2/0 Stability Bounds for Stationary ϕ-ixing and β-ixing Processes Mehryar Mohri Courant Institute of Matheatical Sciences

More information

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

A Smoothed Boosting Algorithm Using Probabilistic Output Codes A Soothed Boosting Algorith Using Probabilistic Output Codes Rong Jin rongjin@cse.su.edu Dept. of Coputer Science and Engineering, Michigan State University, MI 48824, USA Jian Zhang jian.zhang@cs.cu.edu

More information

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data Suppleentary to Learning Discriinative Bayesian Networks fro High-diensional Continuous Neuroiaging Data Luping Zhou, Lei Wang, Lingqiao Liu, Philip Ogunbona, and Dinggang Shen Proposition. Given a sparse

More information

Robustness and Regularization of Support Vector Machines

Robustness and Regularization of Support Vector Machines Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA

More information

1 Proof of learning bounds

1 Proof of learning bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a

More information

Deep Boosting. Abstract. 1. Introduction

Deep Boosting. Abstract. 1. Introduction Corinna Cortes Google Research, 8th Avenue, New York, NY Mehryar Mohri Courant Institute and Google Research, 25 Mercer Street, New York, NY 2 Uar Syed Google Research, 8th Avenue, New York, NY Abstract

More information

Learnability and Stability in the General Learning Setting

Learnability and Stability in the General Learning Setting Learnability and Stability in the General Learning Setting Shai Shalev-Shwartz TTI-Chicago shai@tti-c.org Ohad Shair The Hebrew University ohadsh@cs.huji.ac.il Nathan Srebro TTI-Chicago nati@uchicago.edu

More information

Support Vector Machines. Maximizing the Margin

Support Vector Machines. Maximizing the Margin Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the

More information

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Matheatical Sciences, 251 Mercer Street, New York,

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:

More information

Tight Bounds for Maximal Identifiability of Failure Nodes in Boolean Network Tomography

Tight Bounds for Maximal Identifiability of Failure Nodes in Boolean Network Tomography Tight Bounds for axial Identifiability of Failure Nodes in Boolean Network Toography Nicola Galesi Sapienza Università di Roa nicola.galesi@uniroa1.it Fariba Ranjbar Sapienza Università di Roa fariba.ranjbar@uniroa1.it

More information

Randomized Recovery for Boolean Compressed Sensing

Randomized Recovery for Boolean Compressed Sensing Randoized Recovery for Boolean Copressed Sensing Mitra Fatei and Martin Vetterli Laboratory of Audiovisual Counication École Polytechnique Fédéral de Lausanne (EPFL) Eail: {itra.fatei, artin.vetterli}@epfl.ch

More information

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential

More information

Block designs and statistics

Block designs and statistics Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent

More information

arxiv: v4 [cs.lg] 4 Apr 2016

arxiv: v4 [cs.lg] 4 Apr 2016 e-publication 3 3-5 Relative Deviation Learning Bounds and Generalization with Unbounded Loss Functions arxiv:35796v4 cslg 4 Apr 6 Corinna Cortes Google Research, 76 Ninth Avenue, New York, NY Spencer

More information

Polygonal Designs: Existence and Construction

Polygonal Designs: Existence and Construction Polygonal Designs: Existence and Construction John Hegean Departent of Matheatics, Stanford University, Stanford, CA 9405 Jeff Langford Departent of Matheatics, Drake University, Des Moines, IA 5011 G

More information

Lower Bounds for Quantized Matrix Completion

Lower Bounds for Quantized Matrix Completion Lower Bounds for Quantized Matrix Copletion Mary Wootters and Yaniv Plan Departent of Matheatics University of Michigan Ann Arbor, MI Eail: wootters, yplan}@uich.edu Mark A. Davenport School of Elec. &

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016 Lessons 7 14 Dec 2016 Outline Artificial Neural networks Notation...2 1. Introduction...3... 3 The Artificial

More information

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition

More information

The Weierstrass Approximation Theorem

The Weierstrass Approximation Theorem 36 The Weierstrass Approxiation Theore Recall that the fundaental idea underlying the construction of the real nubers is approxiation by the sipler rational nubers. Firstly, nubers are often deterined

More information

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay

More information

Fixed-to-Variable Length Distribution Matching

Fixed-to-Variable Length Distribution Matching Fixed-to-Variable Length Distribution Matching Rana Ali Ajad and Georg Böcherer Institute for Counications Engineering Technische Universität München, Gerany Eail: raa2463@gail.co,georg.boecherer@tu.de

More information

Ensemble Based on Data Envelopment Analysis

Ensemble Based on Data Envelopment Analysis Enseble Based on Data Envelopent Analysis So Young Sohn & Hong Choi Departent of Coputer Science & Industrial Systes Engineering, Yonsei University, Seoul, Korea Tel) 82-2-223-404, Fax) 82-2- 364-7807

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 1,2 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Matheatical Sciences, 251 Mercer Street, New York,

More information

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are,

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are, Page of 8 Suppleentary Materials: A ultiple testing procedure for ulti-diensional pairwise coparisons with application to gene expression studies Anjana Grandhi, Wenge Guo, Shyaal D. Peddada S Notations

More information

A Theoretical Framework for Deep Transfer Learning

A Theoretical Framework for Deep Transfer Learning A Theoretical Fraewor for Deep Transfer Learning Toer Galanti The School of Coputer Science Tel Aviv University toer22g@gail.co Lior Wolf The School of Coputer Science Tel Aviv University wolf@cs.tau.ac.il

More information

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks Intelligent Systes: Reasoning and Recognition Jaes L. Crowley MOSIG M1 Winter Seester 2018 Lesson 7 1 March 2018 Outline Artificial Neural Networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

E. Alpaydın AERFAISS

E. Alpaydın AERFAISS E. Alpaydın AERFAISS 00 Introduction Questions: Is the error rate of y classifier less than %? Is k-nn ore accurate than MLP? Does having PCA before iprove accuracy? Which kernel leads to highest accuracy

More information

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science A Better Algorith For an Ancient Scheduling Proble David R. Karger Steven J. Phillips Eric Torng Departent of Coputer Science Stanford University Stanford, CA 9435-4 Abstract One of the oldest and siplest

More information

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space Journal of Machine Learning Research 3 (2003) 1333-1356 Subitted 5/02; Published 3/03 Grafting: Fast, Increental Feature Selection by Gradient Descent in Function Space Sion Perkins Space and Reote Sensing

More information

Probability Distributions

Probability Distributions Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples

More information

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence Best Ar Identification: A Unified Approach to Fixed Budget and Fixed Confidence Victor Gabillon Mohaad Ghavazadeh Alessandro Lazaric INRIA Lille - Nord Europe, Tea SequeL {victor.gabillon,ohaad.ghavazadeh,alessandro.lazaric}@inria.fr

More information

Estimating Entropy and Entropy Norm on Data Streams

Estimating Entropy and Entropy Norm on Data Streams Estiating Entropy and Entropy Nor on Data Streas Ait Chakrabarti 1, Khanh Do Ba 1, and S. Muthukrishnan 2 1 Departent of Coputer Science, Dartouth College, Hanover, NH 03755, USA 2 Departent of Coputer

More information

Soft-margin SVM can address linearly separable problems with outliers

Soft-margin SVM can address linearly separable problems with outliers Non-linear Support Vector Machines Non-linearly separable probles Hard-argin SVM can address linearly separable probles Soft-argin SVM can address linearly separable probles with outliers Non-linearly

More information

Boosting with Abstention

Boosting with Abstention Boosting with Abstention Corinna Cortes Google Research New York, NY corinna@google.co Giulia DeSalvo Courant Institute New York, NY desalvo@cis.nyu.edu Mehryar Mohri Courant Institute and Google New York,

More information

Bounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks

Bounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks Bounds on the Miniax Rate for Estiating a Prior over a VC Class fro Independent Learning Tasks Liu Yang Steve Hanneke Jaie Carbonell Deceber 01 CMU-ML-1-11 School of Coputer Science Carnegie Mellon University

More information

Bayes Decision Rule and Naïve Bayes Classifier

Bayes Decision Rule and Naïve Bayes Classifier Bayes Decision Rule and Naïve Bayes Classifier Le Song Machine Learning I CSE 6740, Fall 2013 Gaussian Mixture odel A density odel p(x) ay be ulti-odal: odel it as a ixture of uni-odal distributions (e.g.

More information

arxiv: v1 [cs.ds] 3 Feb 2014

arxiv: v1 [cs.ds] 3 Feb 2014 arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/

More information

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a ournal published by Elsevier. The attached copy is furnished to the author for internal non-coercial research and education use, including for instruction at the authors institution

More information

Support Vector Machines MIT Course Notes Cynthia Rudin

Support Vector Machines MIT Course Notes Cynthia Rudin Support Vector Machines MIT 5.097 Course Notes Cynthia Rudin Credit: Ng, Hastie, Tibshirani, Friedan Thanks: Şeyda Ertekin Let s start with soe intuition about argins. The argin of an exaple x i = distance

More information

A Theoretical Analysis of a Warm Start Technique

A Theoretical Analysis of a Warm Start Technique A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful

More information

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions Tight Inforation-Theoretic Lower Bounds for Welfare Maxiization in Cobinatorial Auctions Vahab Mirrokni Jan Vondrák Theory Group, Microsoft Dept of Matheatics Research Princeton University Redond, WA 9805

More information

Deep Boosting. Abstract. 1. Introduction

Deep Boosting. Abstract. 1. Introduction Corinna Cortes Google Research, 8th Avenue, New York, NY 00 Mehryar Mohri Courant Institute and Google Research, 5 Mercer Street, New York, NY 00 Uar Syed Google Research, 8th Avenue, New York, NY 00 Abstract

More information

Tail estimates for norms of sums of log-concave random vectors

Tail estimates for norms of sums of log-concave random vectors Tail estiates for nors of sus of log-concave rando vectors Rados law Adaczak Rafa l Lata la Alexander E. Litvak Alain Pajor Nicole Toczak-Jaegerann Abstract We establish new tail estiates for order statistics

More information

Handout 7. and Pr [M(x) = χ L (x) M(x) =? ] = 1.

Handout 7. and Pr [M(x) = χ L (x) M(x) =? ] = 1. Notes on Coplexity Theory Last updated: October, 2005 Jonathan Katz Handout 7 1 More on Randoized Coplexity Classes Reinder: so far we have seen RP,coRP, and BPP. We introduce two ore tie-bounded randoized

More information

Generalization Bounds for Learning Weighted Automata

Generalization Bounds for Learning Weighted Automata Generalization Bounds for Learning Weighted Autoata Borja Balle a,, Mehryar Mohri b,c a Departent of Matheatics and Statistics, Lancaster University, Lancaster, UK b Courant Institute of Matheatical Sciences,

More information

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13 CSE55: Randoied Algoriths and obabilistic Analysis May 6, Lecture Lecturer: Anna Karlin Scribe: Noah Siegel, Jonathan Shi Rando walks and Markov chains This lecture discusses Markov chains, which capture

More information

Graphical Models in Local, Asymmetric Multi-Agent Markov Decision Processes

Graphical Models in Local, Asymmetric Multi-Agent Markov Decision Processes Graphical Models in Local, Asyetric Multi-Agent Markov Decision Processes Ditri Dolgov and Edund Durfee Departent of Electrical Engineering and Coputer Science University of Michigan Ann Arbor, MI 48109

More information

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay A Low-Coplexity Congestion Control and Scheduling Algorith for Multihop Wireless Networks with Order-Optial Per-Flow Delay Po-Kai Huang, Xiaojun Lin, and Chih-Chun Wang School of Electrical and Coputer

More information

Non-Parametric Non-Line-of-Sight Identification 1

Non-Parametric Non-Line-of-Sight Identification 1 Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,

More information

Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval

Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval Unifor Approxiation and Bernstein Polynoials with Coefficients in the Unit Interval Weiang Qian and Marc D. Riedel Electrical and Coputer Engineering, University of Minnesota 200 Union St. S.E. Minneapolis,

More information

Shannon Sampling II. Connections to Learning Theory

Shannon Sampling II. Connections to Learning Theory Shannon Sapling II Connections to Learning heory Steve Sale oyota echnological Institute at Chicago 147 East 60th Street, Chicago, IL 60637, USA E-ail: sale@athberkeleyedu Ding-Xuan Zhou Departent of Matheatics,

More information

Lost-Sales Problems with Stochastic Lead Times: Convexity Results for Base-Stock Policies

Lost-Sales Problems with Stochastic Lead Times: Convexity Results for Base-Stock Policies OPERATIONS RESEARCH Vol. 52, No. 5, Septeber October 2004, pp. 795 803 issn 0030-364X eissn 1526-5463 04 5205 0795 infors doi 10.1287/opre.1040.0130 2004 INFORMS TECHNICAL NOTE Lost-Sales Probles with

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lessons 7 20 Dec 2017 Outline Artificial Neural networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation journal of coplexity 6, 459473 (2000) doi:0.006jco.2000.0544, available online at http:www.idealibrary.co on On the Counication Coplexity of Lipschitzian Optiization for the Coordinated Model of Coputation

More information

3.8 Three Types of Convergence

3.8 Three Types of Convergence 3.8 Three Types of Convergence 3.8 Three Types of Convergence 93 Suppose that we are given a sequence functions {f k } k N on a set X and another function f on X. What does it ean for f k to converge to

More information

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical IEEE TRANSACTIONS ON INFORMATION THEORY Large Alphabet Source Coding using Independent Coponent Analysis Aichai Painsky, Meber, IEEE, Saharon Rosset and Meir Feder, Fellow, IEEE arxiv:67.7v [cs.it] Jul

More information

Learning Maximum-A-Posteriori Perturbation Models for Structured Prediction in Polynomial Time

Learning Maximum-A-Posteriori Perturbation Models for Structured Prediction in Polynomial Time Learning Maxiu-A-Posteriori Perturbation Models for Structured Prediction in Polynoial Tie Asish Ghoshal Jean Honorio Abstract MAP perturbation odels have eerged as a powerful fraework for inference in

More information

Domain-Adversarial Neural Networks

Domain-Adversarial Neural Networks Doain-Adversarial Neural Networks Hana Ajakan, Pascal Gerain 2, Hugo Larochelle 3, François Laviolette 2, Mario Marchand 2,2 Départeent d inforatique et de génie logiciel, Université Laval, Québec, Canada

More information

Exact tensor completion with sum-of-squares

Exact tensor completion with sum-of-squares Proceedings of Machine Learning Research vol 65:1 54, 2017 30th Annual Conference on Learning Theory Exact tensor copletion with su-of-squares Aaron Potechin Institute for Advanced Study, Princeton David

More information

SimpleMKL. hal , version 1-26 Jan Abstract. Alain Rakotomamonjy LITIS EA 4108 Université de Rouen Saint Etienne du Rouvray, France

SimpleMKL. hal , version 1-26 Jan Abstract. Alain Rakotomamonjy LITIS EA 4108 Université de Rouen Saint Etienne du Rouvray, France SipleMKL Alain Rakotoaonjy LITIS EA 48 Université de Rouen 768 Saint Etienne du Rouvray, France Francis Bach INRIA - Willow project Départeent d Inforatique, Ecole Norale Supérieure 45, Rue d Ul 7523 Paris,

More information

Introduction to Discrete Optimization

Introduction to Discrete Optimization Prof. Friedrich Eisenbrand Martin Nieeier Due Date: March 9 9 Discussions: March 9 Introduction to Discrete Optiization Spring 9 s Exercise Consider a school district with I neighborhoods J schools and

More information