Max-margin structured output learning in L 1 norm space

Max-margin structured output learning in L norm space Sandor Szedmak ISIS Group, Electronics and Computer Science University of Southampton Southampton, United Kingdom ss03v@ecssotonacuk Yizhao Ni ISIS Group, Electronics and Computer Science University of Southampton Southampton, United Kingdom yn05r@ecssotonacuk Craig J Saunders ISIS Group, Electronics and Computer Science University of Southampton Southampton, United Kingdom cjs@ecssotonacuk Juho Rousu Department of Computer Science University of Helsinki, Finland juhorousu@cshelsinkifi Astract We study a structured output learning setting where oth the sample size and dimensions of the feature vectors of oth the input and output are very large (possily infinite in the latter case), ut the input and output feature representations are nonnegative and very sparse (ie the numer of nonzero components is finite and their proportion to the dimension is close to zero) Such situations are encountered in real-world prolems such as statistical machine translation We show that in this setting structured output learning can e efficiently implemented The solution relies on maximum margin learning of the linear relations etween the inputs and outputs in an L norm space This learning prolem can e formulated y imposing L norm regularisation on the linear transformation expressing the relations Introduction The machine learning researchers have devoted relatively small effort in discovering how the margin ased learning methods ehave in L space Several papers investigate the case when the L norm is applied in the regularisation - eg Linear Programming Boosting [DBST0], or under the name lasso [Ti96], - or(and) in measuring the loss, eg the SVM with soft margin, see for example in [CST00] This paper focuses on the applications where instead of the L regularisation, the margin is measured in an L sense It will e shown that this kind of learning displays characteristic properties allowing very large scale prolems to e solved with moderate computational effort Like in machine learning, also in approximation theory surprisingly little attention to the L norm space has een given A ook [Pin89] summaries some results to e good All authors would like to acknowledge the support of the EU framework six project SMART starting points for further research activities Another valuale source could e [DG85] which deals with density estimation in L space The authors of the latter ook emphasise an important fact that measuring the distance etween two density functions using the L norm is invariant under monotone transformation of the coordinate axes: in other words if only the order of the coordinates are preserved ut the scales are changing then the L normed ase distances of vectors normalised to in the same norm remains the same This allows us to use a nonparametric approach when the underlying distriutions might e irregular, eg they have no expected value Human language is an example application field showing the symptoms of irregularity which has motivated us in formulating the presented approach In the following we first formulate the supervised learning prolem for structured outputs, then secondly present the optimisation framework We then show that the ase prolem can e solved in a very simple way which leads us to an online algorithm Using a perceptron type interpretation we are ale to state Novikoff-style ounds for the new algorithm 2 General Setting We are given a sample of pairs of input and output ojects {x i, y i }, i =,, m, taken from the sets X and Y independently with respect to an unknown distriution defined on X Y Furthermore, there exist two functions φ and ψ which map the input and output ojects into linear vector spaces, namely φ : X L φ, ψ : Y L ψ, where L φ and L φ are linear vector spaces whose elements represent the input and output ojects The task is to find a linear transformation W which gives a good predictor of the outputs represented in the corresponding linear vector space y the feature vector of the inputs ψ(y) Wφ(x) One concrete example of this style of prolem is a machine translation task where oth the input and output ojects

are sentences taken from natural languages, eg English and French, and are represented occurrences of phrases: ngrams, sustrings of words with a special structure, etc Thus, the input and output feature vectors φ(x i ) and ψ(y i ) have very high dimensions ut they are very sparse Many other applications studied in the structured prediction literature [TGK03, TLJJ06, TJHA05, RSSST06] however also fit naturally into this framework 3 Optimisation prolem We are going to express the relations etween the input and the output via a linear transformation projecting the input feature vectors into the space of the output feature vectors which could e an optimum solution of the following maximum margin prolem: min r(w) wrt W : L φ L ψ, Linear operator st ψ(y i ), Wφ(x i ) Lψ, i =,, m, > 0 given constant The ojective function r() is assumed to e a regularisation function and its concrete definition is derived later The constraints force the inner products etween the output feature vectors and the images of the input feature vectors with respect to the linear operator W to e sufficiently and uniformly large We use the inner product in a rather algeraic sense u, v = n k= u kv k, u, v R n instead of the geometric one which assumes a Hilert space in the ackground The constraints can e rewritten expressing, instead of a regression task, a one-class classification task in the joint feature space for inputs and outputs, namely ψ(yi ), Wφ(x i ) L ψ = tr ( ψ(y i ) T Wφ(x i ) ) = tr ( Wφ(x i )ψ(y i ) ) T = W, [ ψ(y i ) φ(x i ) ] (), L ψ L φ where tr() denotes the trace of the matrix in the argument, and operator marks the tensor product etween its operands Since the tensor product of two linear vector spaces is a linear vector space too, so we can interpret z i = φ(x i ) ψ(y i ) as vectors and the linear operator W ecomes a linear functional, thus a vector in the dual space of the space spanned y the vectors {z i }, i =,, m, hence we can use vector notation w as well Based on () we arrive at a prolem coinciding in form with the standard one-class SVM classification prolem: min r(w) wrt w st z i w, i =,, m, > 0 given constant, z i R nz The well-known approaches to the regularisation apply L 2 norm [Vap98], eg Support Vector Machine, or L norm, eg Linear Programming Boosting [DBST0] The use of structured outputs or feature spaces in the output space has (2) recently een studied using the standard L 2 norm ased regularisation, see eg [SSTPH05] and [AHP + 08] However investigating the case when the maximum margin is measured in L norm have rarely een investigated We are going to focus on the regularisation function r() which maximises the L norm ased distance etween the separating hyperplane and the origin in the one-class prolem To this end a suprolem can e formulated computing this distance measured etween the origin and the closest point of the hyperplane for which we have min u wrt u st w u =, saying that a vector u sitting on the hyperplane is looked for with the minimum L norm The entire prolem which maximises the minimum distance takes the following form [ ] min u max f(w, ) = wrt u st w u = (4) wrt st w z i w, i =,, m > 0 given constant In the sequel we are going to deal with the sucase of (4) when the non-negativity conditions z i 0, i =,, m hold 4 Optimum solution Let us solve first the suprolem given y (3) for a fixed w Via a simple argument one can show that the optimum value of (3) can e given in a closed form Proposition The optimum value to the prolem (3) is equal to u = max j w j = (5) Proof: First for sake of simplicity, we divide oth sides of the equality constraint y, due to it eing strictly positive no effect on the prolem is caused Let us denote w with w We can assume at least one component of w differs from 0 otherwise no feasile solution exists Now we unfold the norm in the ojective y applying the sustitution u = u + u, u + 0, u 0 and write up the dual prolem as well min ( ) u + + u max γ wrt u +, u ) st w ( u+ u =, u + 0, u 0, wrt γ st γw, γw For any strictly positive components of w we have (3) (6) γw γ min j (w ) j, (w ) j > 0, (7) and for any strictly negative components of the same vector the following holds γw γ min, (w ) j < 0, (8) j (w ) j

therefore γ min j (w ) j = max j (w ) j = w (9) Since the primal ojective has an lower ound, ie 0, thus the dual has a feasile ounded optimal solution and the optimum dual value is equal to the optimum primal value, therefore, γ = u which is the statement of the proposition Based on Proposition we have max f(w, ) = wrt w st z i w, i =,, m > 0 given constant, and after reformulation it as a minimisation prolem we otain min wrt w (0) st z i w, i =,, m > 0 given constant Proposition 2 If z i 0 for all i =,, m then the optimum solution for the Linear Programming Prolem given y (0) is equal to w =, where i = arg min i z i () Proof: First oth side of the equality constraints are divided y Let us use the notation w for w So we can otain min w wrt w st z i w, i =,, m, > 0 given constant, z i 0, i =,, m (2) We can recognise that w = is a feasile solution, since z iw = z i = z i = z i min i z i (3) Now we need to prove that w is also an optimum solution If it is not true then we can find a ŵ which is feasile and ŵ < w This means that there is a constant β such that ŵ j β < for any j =,, n z From z i 0, i =,, m it follows that β > 0 otherwise the feasiility assumption is immediately violated Let us check the feasiility of ŵ (z i ) ŵ β(z i ) = β <, (4) hence, ŵ violates the constraint elonging to z i, with the smallest L norm Thus w is an optimum solution for (0) With a constant, completely flat optimum solution, the predictor to a new φ(x) 0 can e written as ( ψ(y)) j = (Wφ(x)) j = φ(x), z i which might not seem interesting at first sight, however let us now consider the sparse case in the next susection 4 Sparse case First we define what we understand on sparseness in the prolem given y (0) Sparseness means that there is at least an index j {,, n z } such that for all (z i ) j = 0 holds A consequence of this kind of sparseness is that the corresponding components of w has no influence on the feasiility, thus, this component is not determined except for the upper and lower ound imposed y the ojective function, namely min Hence, the optimum solution to (0) ecomes a set containing the elements oeying the form { d if i =,, m, (w (zi ) ) j = [ ] j > 0, d, d otherwise, where d = min i z i (5) Because z i = ψ(y i ) φ(x i ) then (z i ) j is equal to 0 in each of the components where the corresponding components of either ψ(y i ), or φ(x i ) or oth, are equal to 0, which has high proaility if oth terms in the tensor product have lots of zero elements In the sparse case we can impose a further optimisation on our ase prolem, (0), which minimises the numer of non-zero components of w: min w 0 wrt w, st w W, arg min where W = wrt st w z i w, i =,, m > 0 given constant, (6) where the norm w 0 means the numer of non-zero components of the vector in the argument This extension gives us the following optimum solution { (w d if i =,, m, (zi ) ) j = j > 0, 0 otherwise where d is defined as efore 42 Scale independency (7) If the prediction ψ(ỹ) = Wφ(x) depends on the shape not on the scale we can employ a particular normalised solution { (w if i =,, m, (zi ) ) j = j > 0, (8) 0 otherwise instead of the original one given y (5)

5 Kernelization In L norm we can not apply the same kernelization that is straightforwardly implemented in the L 2 norm case via the dual form of the optimisation prolem A simple approach to this prolem is presented here Let us consider the following emedding Φ : Z R m where the components of Φ(z i ) are defined y (Φ(z i )) k = z k, z i = φ(x k ) ψ(y k ), φ(x i ) ψ(y i ) = φ(x k ), φ(x i ) ψ(y k ), ψ(y i ), (9) }{{}}{{} κ φ (x k,x i) κ ψ (y k,y i) where κ φ and κ ψ are the kernel functions Changing the margin constraints in (4) from into st z i w, i =,, m, (20) st Φ(z i ) w Φ, i =,, m, (2) we received the same structure, thus, the solution to the modified prolem follows the same pattern One can recognise that all we did is nothing more than expressing the linear operator projecting the input into the output space y W = m k= (w Φ) k ψ(y k ) φ(x k ) (22) A similar kernelization is applied in for example [Man00] to the inary Support Vector Machine 6 Prediction The prediction to an aritrary x can e given y y = ψ (Wφ(x)), ut to give sustance and some interpretation to this formula we need to make some additional remarks The prediction of Wφ(x) gives a score vector to every input vector, ut this score vector lives in the linear vector space representing the output and not the output oject itself For the complete solution we need to find an inverse image, which is not always straightforward The inversion can e carried out if we know the structure of the output space and how it is represented The first step the computation of the scores to ψ(y) is demonstrated in Figure The prediction can e derived y taking the optimum solution of (8) To find the optimal output y to an input x we need to estalish a model First, to invert the function ψ the next optimisation schema can e applied: ŷ = arg max W, [ ψ(y) φ(x) ] L ψ L φ wrt y (23) st y Ŷ, where Ŷ is the set of possile outputs The next element of our inversion model is the assumption such that the vectors of ψ(y), y Ŷ are indicator vectors of the possile patterns should appear in the output, so that we are looking for a vector with a relatively small numer of non zeros pointing to the patterns and all other components ought to e set to 0 (ie this is explicit in the machine translation example mentioned earlier; vectors are indexed y the entire set of possile phrases/features in a language, ut elements are only non-zero for those that actually occur in any given sentence) Our solution approach can exploit the fact that the ojective function of (23) is linear in ψ(y) and can e written as d ψ(y), where d = Wφ(x) However, the maximum of a linear function can e finite only if the components of the vector ψ(y) are ounded components-wise or(and) in a norm To this end let us consider the following prolem max d u wrt u, st (24) u =, 0 u C, which has a simple optimum solution To derive that first we need to sort the components of vector d into a decreasing order (d,, d nd ) (d i, d i2,, d ind ), (25) and let N the smallest integer greater than, then an optimum is given y C { C if k < K, u ik = C(K ) if k = K, (26) 0 otherwise Now the task can e stated as finding an optimal ound C which can approximate the numer of nonzero items in the indicator which we look for Let us consider the following general two level prolem, where the inner part follows (24) d u t max g(t) max wrt u t, (27) t=,2,, st u t =, 0 u t g(t) where t is running on the positive integers and g a real valued monotone increasing function What is expressed here is that if t = we chose the component of d with the highest value then the two highest ones and so onwe might stop at the first local minimum of the outer prolem The function g relates to the speed of the decay of decreasingly ordered components of the vector d, thus, in this way it is prolem dependent Possile choices are g(t) = log(t) or g(t) = t The linear case g(t) = t is oviously wrong, since in case of a decreasing sequence it gives surely the optimum when t = Further research activity is needed to estimate a good candidate The prediction is then derived of the optimum solution u t of (27) interpreting it as an indicator vector We need to mention the numer of non-zero components in this optimum solution will e proportional to the non-zero components of the input ut this connection is not a direct one 7 An online framework, a set ased perceptron learner The motivation of the online approach stems of the structure of the solution to the sparse case which, in turn, can e interpreted as a learning method of the possile co-existences of the parts of input and output vectors via an indicator vector

ψ(y) = Wφ(x) φ(x) output scores p P p P 2 p P 3 0 3 0 2 0 2 0 Figure : The prediction schema column(row) vectors are indicators of candidates W First we outline the prolem that we are going to solve Let us assume that the input and output ojects are represented y the following way: Given two, supposed to e finite, sets Ω x and Ω y the collections of the possile pattern characterising the ojects x and y Every oserved x and y are descried y function µ x : Ω x R + and µ y : Ω y R +, where R + denotes the nonnegative real numers They may e measures of the importance of the patterns Assume that ω x Ω x µ x (ω x ) = and ω y Ω y µ y (ω y ) =, the measures are normalised in L norm Thus, µ x and µ y might e interpreted as proailities Let φ(x) = (µ x (ω x )) and ψ(y) = (µ y (ω y )) the vectors of the weights of the patterns From these conditions we can derive Ω Z = Ω x Ω y, and z = φ(x) ψ(y) a tensor product which as a consequence is normalised to as well Based on the model stated we can formulate the next learning task which is to find w a linear functional of the space {0, } Ωz corresponding to a suset W of Ω z such that w, z i λ, where λ > 0 is given constant and i =,, m are the indeces of the sample items It means y the definition of the inner product w, z i = µ xi (ω xi )µ yi (ω yi ), ω xi Ω xi,ω yi Ω yi thus, that we are looking for is a set W which comprises a sufficiently large numer of the most important common patterns of all sample items We can associate sets represented y indicators to all the sample items z i = ψ(y i ) φ(x i ) and to the linear functional w as well y z i [z i > 0] = {z ij > 0} = Z i w [w > 0] = {w j > 0} = W (28) Since all the nonzero components of w are equal to a constant, thus we can restore w from its set ased representation Algorithm Primal perceprtron for sets Input of the learner: The sample S, Output of the learner: W R dim(hy)dim(hx) Initialisation: W t = ; i = ;, noupdate = true repeat for i =, 2,, m do read input: z i = ψ(y i ) φ(x i ) t = 0 if w t, z i < λ then w t W t z i Z i W t+ = W t Z i {Set update} W t+ w t+ t = t + noupdate = false end if end for until noupdate is true via a ijective mapping etween the vector and set representation We can write up a perceptron type algorithm, see details in [CST00], for solving prolems following the schema of the L norm ased maximum margin learner The algorithm is given y Algorithm A Novikoff-style ound [Nov62] on this algorithm can e stated Based on the definition of w for any two realisations w k and w l w k, w l = W k W l, and let w 2 = w, w = W We can define a margin to the perceptron learner y γ(w, S) = min (y i,x i) S w, z i w 2, z i = ψ(y i ) φ(x i ) (29) Let δ i (λ, w) e such that if w, z i λ then W Z i δ i (λ, w) This kind of δ i (λ, w) > l min has to exist as a consequence of the definition of the inner product, the nonnegativity and the normalisation of the sample items Now we consider the minimum of them, ie δ(λ, w) = min δ i(λ, w) i=,,m The non-negativity guarantees δ(λ, w) is monotonic, increasing function of its two variales, so, greater λ or W implies greater δ Theorem 3 Let S = {(x i, y i )} (Y X ), i =,, e a sample independently and identically drawn from an unknown distriution and let φ : X L φ and ψ : Y L ψ e mappings into spaces of tuples of indicators Let z i = ψ(y i ) φ(x i ) e tensor product and Z i the indicator set of nonzero items in z i Assume that 0 < λ <, and 0 < l min Z i l max, i =,, m, the support, the numer of patterns, of every sample items falls within a given range Furthermore, there is a w such that Algorithm () stops with no more update, w, z i λ, i =,, m, and then the numer of updates in Algorithm () is ounded y t l max 2, (30)

2 the margin at the solution w t has a lower ound γ(w t, S) λ l max, (3) where = δ(λ, w ) w 2 Proof: We are going to follow the main thread of the Novikoff s reasoning, ut with some extensions Because w satisfies all the margin constraints, therefore for any t we have W W t, thus, δ(λ, W ) δ(λ, W t ) Let us use the short notation δ(λ) = δ(λ, W ) After t steps of update the squared L 2 norm of w t can e ounded y w t 2 2 = W t l max t (32) Now consider the following inner product w, w t = W W t = W (W t Z it ) = (W W t ) (W Z it ) y induction on t (33) = W ( t Z i t ) δ(λ)t (34) since for any t W t W holds Merging the inequalities aove we otain lmax t w 2 w 2 w t 2 w, w t δ(λ)t (35) It gives us an upper ound on the numer of updates as function of the functional margin t l max 2 (36) After sustituting this inequality into (32) we have w t 2 l max, (37) and at the end for the functional margin the lower ound can e otained γ(w t, S) λ l max, (38) which statement completes the proof Remark 4 One can recognise ehind this scenario a special variant of the weighted set covering prolem The sample items to e found as errors in the perceptron algorithm give a cover to all of the patterns occurring in the sample It is oviously not an optimum cover, the smallest in cardinality of all possile ones, ut a sufficiently good one It is an open question how to extend the range of the applications of the machine learning algorithms of this kind to produce approximations for hard cominatorial prolems This relationship allow us to find some connections etween the L norm ased learning and the Set Covering Machine introduced in [MST02] 8 Discussion We have shown in this paper that measuring the margin y L norm in a maximum margin learning prolem gives a simple solution to an otherwise hardly tractale class of structural learning tasks References [AHP + 08] K Astikainen, L Holm, E Pitknen, S Szedmak, and J Rousu Towards structured output prediction of enzyme function In BMC Proceedings, 2(Suppl 4):S2 2008 [CST00] N Cristianini and J Shawe-Taylor An introduction to Support Vector Machines and other kernel-ased learning methods Camridge University Press, 2000 [DBST0] A Demiriz, K P Bennett, and J Shawe-Taylor Linear programming oosting via column generation Machine Learning, 46::225 254, 200 [DG85] L Devroye and L Györfi Nonparametric Density Estimation: The L View John Wiley, New York, 985 [Man00] O L Mangasarian Generalized support vector machines In Advances in Large Margin Classifiers, pages 35 46 MIT Press, 2000 [MST02] M Marchand and J Shawe-Taylor The set covering machine Journal of Machine Learning Research, 3, 2002 [Nov62] A Novikoff On convergence proofs for perceptrons In In Report at the Symposium on Mathematical Theory of Automata, pages 24 26 Politechnical Institute Brooklyn, 962 [Pin89] AM Pinkus On L-approximation Camridge University Press, 989 [RSSST06] J Rousu, CJ Saunders, S Szedmak, and J Shawe-Taylor Kernel-ased learning of hierarchical multilael classification models Journal of Machine Learning Research, Special issue on Machine Learning and Large Scale Optimization, 2006 [SSTPH05] S Szedmak, J Shawe-Taylor, and E Parado- Hernandez Learning via linear operators: Maximum margin regression In PAS- CAL Research Reports, http://eprintspascalnetworkorg/ 2005 [TGK03] B Taskar, C Guestrin, and D Koller Max margin markov networks In NIPS 2003 2003 [Ti96] R Tishirani Regression shrinkage and selection via the lasso Journal of Royal Statistical Society, Series B, 58:267 288, 996 [TJHA05] I Tsochantaridis, T Joachims, T Hofmann, and Y Altun Large margin methods for structured and interdependent output variales Journal of Machine Learning Research (JMLR), 6(Sep):453 484, 2005 [TLJJ06] B Taskar, S Lacoste-Julien, and M I Jordan Structured prediction, dual extragradient and regman projections In JMLR, Special Topic on Machine Learning and Optimization, pages 627 653 2006 [Vap98] V Vapnik Statistical Learning Theory Wiley, 998