Collaborative Raning for Local Preferences Supplement Ber apicioglu Davi S Rosenberg Robert E Schapire ony Jebara YP YP Princeton University Columbia University Problem Formulation Let U {,,m} be the set of users, let V {,,n} be the set of items, an let {,,} inicate the local time hen, the sample space is efine as X {(u, C, i, t) u U,C V,i C, t } () Let P [ ] enote probability, let C i be the set C excluing element i, anletc U C mean that c is sample uniformly from C hen, the local raning loss associate with hypothesis g is L g (u, C, i, t) P [g (u, i, t) g (u, c, t) 0] () c U C i A Boun on the Generalization Error We assume that the hypothesis class is base on the set of low-ran matrices Given a low-ran matrix M, letg M F be the associate hypothesis, where g M (u, i) M u,i hroughout the paper, we abuse notation an use g M an M interchangeably We assume that ata is generate with respect to D, which is an unnown probability istribution over the sample space X, anwelete enote expectation hen, the generalization error of hypothesis M is E L M (u, C, i), which is the quantity we boun (u,c,i) D below We will erive the generalization boun in two steps In the first step, we will boun the empirical Raemacher complexity of our loss class, efine below, with respect to samples that contain exactly caniates, an in the secon step, we will prove the generalization boun with a reuction to the previous step Lemma Let m be the number of users an let n be the number of items Define L r {L M M R m n has ran at most r} as the class of loss functions associate with low-ran matrices Assume that S X is a set of samples, where each sample contains exactly caniate items; ie if (u, C, i) S, then C Let R S (L r ) enote the Raemacher complexity of L r with respect to S hen, 6emn r (m n)ln r(mn) R S (L r ) Proof Because each sample in S contains exactly caniates, any hypothesis L M L r applie to a sample in S outputs either 0 or hus, the set of ichotomies that are realize by L r on S, calle Π Lr (S ),iswell-efine UsingEquation(6) from Boucheron et al [], we now that R S (L r ) ln Π Lr (S ) Let X X be the set of all samples that contain exactly caniates, Π Lr (S ) Π Lr (X ), soitsuffices to boun Π Lr (X ) We boun Π Lr (X ) by counting the sign configurations of polynomials using proof techniques that are influence by Srebro et al [4] Let (u, {i, j},i) X be a sample an let M be a hypothesis matrix Because M has ran at most r, it can be written as M UV, where U R m r an V R n r Let enote an inicator function that is if an only if its argument is true hen, the loss on the sample can also be rewritten as L M (u, {i, j},i)m u,i M u,j 0 UV UV 0 r U u,i u,j u,a (V i,a V j,a ) a 0 Since carinality of X is at most m n mn, putting it all together, it follows that Π Lr (X ) is boune by the number of sign configurations of mn polynomials, each of egree at most, overr (m n) variables Applying Corollary 3 from Srebro et al [4], we obtain Π Lr (X ) 6emn r(mn) r(mn) aing logarithms an maing basic substitutions yiel the esire result We procee to proving the more general result via a reuction to Lemma heorem Let m be the number of users an let n be the number of items Assume that S consists of
Ber apicioglu, Davi S Rosenberg, Robert E Schapire, ony Jebara inepenently an ientically istribute samples chosen from X with respect to a probability istribution D Let L M be the loss function associate with a matrix M, as efine in Equation hen, with probability at least δ, for any matrix M R m n with ran at most r, E L M (u, C, i) (u,c,i) D r (m n)ln 6emn r E L M (u, C, i) (u,c,i) S U ln δ (3) we can boun the empirical local raning loss as E L M (u, C, i) L M (u, C, i) (u,c,i) S U P [M u,i M u,c 0] c U C i E M u,i M u,c 0 c U C i C i UV u,i UV u,c 0 c C i UV u,i UV u,c (4) Proof We will manipulate the efinition of Raemacher complexity [] in orer to use the boun given in Lemma : R S (L r ) E σ E σ sup L M L r sup L M L r σ a L M (u a,c a,i a ) a σ a E U a j a (Ca\{i a}) L M (u a, {i a,j a },i a ) E sup E σ a L M (u a, {i a,j a },i a ) σ L M L r j,,j a E E sup σ a L M (u a, {i a,j a },i a ) σ j,,j L M L r a E sup σ a L M (u a, {i a,j a },i a ) j,,j Eσ L M L r E [R S (L r )] j,,j r (m n)ln a 6emn r(mn) We note that the CLR an the raning SVM [] objectives are closely relate If V is fixe an we only nee to imize U, then each row of V acts as a feature vector for the corresponing item, each row of U acts as a separate linear preictor, an the CLR objective ecomposes into solving simultaneous raning SVM problems In particular, let S u {(a, C, i) S a u} be the examples that correspon to user u, letu u enote row u of U, anlet f rsvm enote the objective function of raning SVM, then f CLR (S; U, V ) U F UV UV u,i u,c m U u F u m u u m f rsvm (S u ; U u,v) u 4 Algorithms UV u,i UV u,c Plugging the boun to heorem 3 in Boucheron et al [] proves the theorem 4 Derivation Let (u, C, i) S be an example, then the corresponing approximate objective function is 3 Collaborative Local Raning Let h (x) max (0, x) be the hinge function, let M be the hypothesis matrix with ran at most r, anlet M UV, where U R m r an V R n r hen, f CLR ((u, C, i);u, V ) V F UV UV u,i u,c We introuce various matrix notation to help us efine the approximate subgraients Given a matrix M, let
Ber apicioglu, Davi S Rosenberg, Robert E Schapire, ony Jebara Algorithm Alternating imization for optimizing the CLR objective Input: raining ata S X,regularizationparameter > 0, ranconstraintr, numberofiterations : U Sample matrix uniformly at ranom from m r mr, mr : V Sample matrix uniformly at ranom from n r nr, nr 3: for all tfrom to o 4: U t arg f CLR (S; U, V t ) U 5: V t arg f CLR (S; U t,v) V 6: return U,V M, enote row of M Definethematrix ˆM p,q,z,for p q, as M z, for s p, ˆM s, p,q,z M z, for s q, (5) 0 otherwise, an efine the matrix ˇM p,q,z s, ˇM p,q,z s, as M p, M q, for s z, 0 otherwise (6) Let enote an inicator function that is if an only if its argument is true hen, the subgraient of the approximate objective function with respect to V is V f CLR ((u, C, i);u, V )V UV UV < Û i,c,u (7) u,i u,c c C i Setting η t t as the learning rate at iteration t, the approximate subgraient upate becomes V t V t η t V f CLR ((u, C, i);u, V ) Aftertheupate,the weights are projecte onto a ball with raius he pseuocoe for optimizing both convex subproblems is epicte in Algorithms an 3 We prove the correctness of the algorithms an boun their running time in the next subsection 4 Analysis he convex subproblems we analyze have the general form X D f (X; ) X D X F (X;(u, C, i)) (8) Algorithm Projecte stochastic subgraient escent for optimizing U Input: Factors V R n r,trainingatas, regularization parameter, ranconstraintr, numberof iterations : U 0 m r : for all tfrom to o 3: Choose (u, C, i) S uniformly at ranom 4: η t t 5: C c C i U t V u,i U t V u,c < 6: U t ( η t ) U t ηt i,c,u ˇV c C 7: U t, Ut U t F 8: return U Algorithm 3 Projecte stochastic subgraient escent for optimizing V Input: Factors U R m r,trainingatas, regularization parameter, ranconstraintr, numberof iterations : V 0 n r : for all tfrom to o 3: Choose (u, C, i) S uniformly at ranom 4: η t t 5: C c C i UVt UV u,i t u,c < 6: V t ( η t ) V t ηt Û i,c,u c C 7: V t, Vt V t F 8: return V One can obtain the iniviual subproblems by specifying the omain D an the loss function Forexample, in case of Algorithm, the corresponing imization problem is specifie by X R m rf (X; V ) where V (X;(u, C, i)) XV u,i XV u,c, an in case of Algorithm 3, it is specifie by X R n rf (X; U ) where U (X;(u, C, i)) UX u,i UX u,c Let U arg U f (U; V ) an V arg V f (V ; U ) enote the solution matrices of Equations 9 an 0, respectively Also, given a general convex loss an omain D, let X D be an (9) (0)
Ber apicioglu, Davi S Rosenberg, Robert E Schapire, ony Jebara -accurate solution for the corresponing imization problem if f X; X D f (X; ) In the remainer of this subsection, we show that Algorithms an 3 are aaptations of the Pegasos [3] algorithm to the CLR setting hen, we prove certain properties that are prerequisites for obtaining Pegasos s performance guarantees In particular, we show that the approximate subgraients compute by Algorithms an 3 are boune an the loss functions associate with Equations 9 an 0 are convex In the en, we plug these properties into a theorem prove by Shalev-Shwartz et al [3] to show that our algorithms reach an -accurate solution with respect to their corresponing imization problems in Õ iterations Lemma U an V Proof One can obtain the bouns on the norms of the optimal solutions by exaing the ual form of the optimization problems an applying the strong uality theorem Equations 9 an 0 can both be represente as v D v e h (f (v)), () where e C is a constant, h is the hinge function, D is a Eucliean space, an f is a linear function We rewrite Equation as a constraine optimization problem v D,ξ R v e ξ () subject to ξ f (v),,, ξ 0, he Lagrangian of this problem is L (v, ξ,, β) v e ξ, ( f (v) ξ ) β ξ v ξ (e β ) ( f (v)), an its ual function is g (, β) infl (v, ξ,, β) v,ξ Since L (v, ξ,, β) is convex an ifferentiable with respect to v an ξ,thenecessaryansufficient conitions for imizing v an ξ are v L 0 v v f (v), ξ L 0 e β (3) We plug these conitions bac into the ual function an obtain g (, β) infl (v, ξ,, β) v,ξ v f (v) f v f (v) v f (v) (4) f v f (v) Since f is a linear function, we let f (v) v, where is a constant vector, an v f (v) hen, v f (v) f f v f (v) (5) Simplifying Equation 4 using Equation 5 yiels g (, β) v f (v) (6) Finally, we combine Equations 3 an 6, an obtain the ual form of Equation, max (7) subject to 0 e,,
Ber apicioglu, Davi S Rosenberg, Robert E Schapire, ony Jebara he primal problem is convex, its constraints are linear, an the omain of its objective is open; thus, Slater s conition hols an strong uality is obtaine Furthermore, the primal problem has ifferentiable objective an constraint functions, which implies that (v, ξ ) is primal optimal an (, β ) is ual optimal if an only if these points satisfy the arush-uhn- ucer () conitions It follows that v (8) Note that we efine e C, where e, an the constraints of the ual problem imply 0 e ;thus, Because of strong uality, there is no uality gap, an the primal an ual objectives are equal at the optimum, v e ξ v his proves the lemma v v (by (8)) Given the bouns in Lemma, it can be verifie that Algorithms an 3 are aaptations of the Pegasos [3] algorithm for optimizing Equations 9 an 0, respectively It still remains to show that Pegasos s performance guarantees hol in our case Lemma 3 In Algorithms an 3, the approximate subgraients have norm at most Proof he approximate subgraient for Algorithm 3 is epicte in Equation 7 Due to the projection step, V F, an it follows that V F he term Û i,c,u is constructe using Equation 5, an it can be verifie that Û i,c,u F U F Using triangle inequality, one can boun Equation 7 with A similar argument can be mae for the approximate subgraient of Algorithm, yieling the slightly higher upper boun given in the lemma statement We combine the lemmas to obtain the correctness an running time guarantees for our algorithms Lemma 4 Let 4,let be the total number of iterations of Algorithm, an let U t enote the parameter compute by the algorithm at iteration t Let Ū t U t enote the average of the parameters prouce by the algorithm hen, with probability at least δ, f Ū; V f (U ; V ) ln δ he analogous result hols for Algorithm 3 as well Proof First, for each loss function V an U,variables are linearly combine, compose with the convex hinge function, an then average All these operations preserve convexity, hence both loss functions are convex Secon, we have argue above that Algorithms an 3 are aaptations of the Pegasos [3] algorithm for optimizing Equations 9 an 0, respectively hir, in Lemma 3, we prove a boun on the approximate subgraients of both algorithms Plugging these three results into Corollary in Shalev-Shwartz et al [3] yiels the statement of the theorem he theorem below gives a boun in terms of iniviual parameters rather than average parameters heorem Assume that the conitions an the boun in Lemma 4 hol Let t be an iteration inex selecte uniformly at ranom from {,,} hen, with probability at least, 4 f (U t ; V ) f (U ; V ) ln δ he analogous result hols for Algorithm 3 as well Proof he result follows irectly from combining Lemma 4 with Lemma 3 in Shalev-Shwartz et al [3] hus, with high probability, our algorithms reach an -accurate solution in Õ iterations Since we argue in Subsection 4 that the running time of each stochastic upate is O (br), it follows that a complete run of projecte stochastic subgraient escent taes Õ br time, an the running time is inepenent of the size of the training ata References [] Stéphane Boucheron, Olivier Bousquet, an Gabor Lugosi heory of classification : A survey of some recent avances ESAIM: Probability an Statistics, 9:33 375, 005
Ber apicioglu, Davi S Rosenberg, Robert E Schapire, ony Jebara [] horsten Joachims Optimizing search engines using clicthrough ata In Proceeings of the eighth ACM SIGDD international conference on nowlege iscovery an ata ing, DD 0, pages 33 4, New Yor, NY, USA, 00 ACM [3] Shai S Shwartz, Yoram Singer, Nathan Srebro, an Anrew Cotter Pegasos: primal estimate subgraient solver for SVM Math Program, 7():3 30, March 0 [4] Nathan Srebro, Noga Alon, an ommi Jaaola Generalization error bouns for collaborative preiction with Low-Ran matrices In NIPS, 004