Collaborative Ranking for Local Preferences Supplement

Similar documents
Collaborative Ranking for Local Preferences

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

Least-Squares Regression on Sparse Spaces

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

A Course in Machine Learning

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

7.1 Support Vector Machine

PDE Notes, Lecture #11

Euler equations for multiple integrals

On the Generalization Ability of Online Strongly Convex Programming Algorithms

SINGULAR PERTURBATION AND STATIONARY SOLUTIONS OF PARABOLIC EQUATIONS IN GAUSS-SOBOLEV SPACES

On the Equivalence of Weak Learnability and Linear Separability: New Relaxations and Efficient Boosting Algorithms

Connections Between Duality in Control Theory and

On the Cauchy Problem for Von Neumann-Landau Wave Equation

A LIMIT THEOREM FOR RANDOM FIELDS WITH A SINGULARITY IN THE SPECTRUM

Convergence of Random Walks

Machine Learning Lecture 6 Note

'HVLJQ &RQVLGHUDWLRQ LQ 0DWHULDO 6HOHFWLRQ 'HVLJQ 6HQVLWLYLW\,1752'8&7,21

Lower bounds on Locality Sensitive Hashing

Mini-Batch Primal and Dual Methods for SVMs

Homework 2 Solutions EM, Mixture Models, PCA, Dualitys

All s Well That Ends Well: Supplementary Proofs

Proof of SPNs as Mixture of Trees

ALGEBRAIC AND ANALYTIC PROPERTIES OF ARITHMETIC FUNCTIONS

II. First variation of functionals

Math 342 Partial Differential Equations «Viktor Grigoryan

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Adaptive Gain-Scheduled H Control of Linear Parameter-Varying Systems with Time-Delayed Elements

u!i = a T u = 0. Then S satisfies

Dissipative numerical methods for the Hunter-Saxton equation

A Unified Theorem on SDP Rank Reduction

PETER L. BARTLETT AND MARTEN H. WEGKAMP

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Function Spaces. 1 Hilbert Spaces

Discrete Operators in Canonical Domains

3.7 Implicit Differentiation -- A Brief Introduction -- Student Notes

Applications of the Wronskian to ordinary linear differential equations

Calculus and optimization

Structural Risk Minimization over Data-Dependent Hierarchies

Lecture 10: October 30, 2017

Introduction to the Vlasov-Poisson system

Iterated Point-Line Configurations Grow Doubly-Exponentially

On colour-blind distinguishing colour pallets in regular graphs

A Dual-Augmented Block Minimization Framework for Learning with Limited Memory

Self-normalized Martingale Tail Inequality

3.6. Let s write out the sample space for this random experiment:

Analyzing Tensor Power Method Dynamics in Overcomplete Regime

Chaos, Solitons and Fractals Nonlinear Science, and Nonequilibrium and Complex Phenomena

LECTURE NOTES ON DVORETZKY S THEOREM

Discrete Mathematics

Table of Common Derivatives By David Abraham

Logarithmic spurious regressions

The Subtree Size Profile of Plane-oriented Recursive Trees

On combinatorial approaches to compressed sensing

On the Complexity of Bandit and Derivative-Free Stochastic Convex Optimization

Inverse Time Dependency in Convex Regularized Learning

Calculus of Variations

arxiv: v2 [cs.ds] 11 May 2016

Sublinear Optimization for Machine Learning

Two formulas for the Euler ϕ-function

An extension of Alexandrov s theorem on second derivatives of convex functions

26.1 Metropolis method

Math 300 Winter 2011 Advanced Boundary Value Problems I. Bessel s Equation and Bessel Functions

Conservation Laws. Chapter Conservation of Energy

Optimal A Priori Discretization Error Bounds for Geodesic Finite Elements

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors

Monte Carlo Methods with Reduced Error

LOCAL WELL-POSEDNESS OF NONLINEAR DISPERSIVE EQUATIONS ON MODULATION SPACES

Slide10 Haykin Chapter 14: Neurodynamics (3rd Ed. Chapter 13)

arxiv: v4 [math.pr] 27 Jul 2016

arxiv: v1 [cs.lg] 22 Mar 2014

A New Converse Bound for Coded Caching

High-Dimensional p-norms

Stochastic Gradient Descent with Only One Projection

SVD-free Convex-Concave Approaches for Nuclear Norm Regularization

A New Minimum Description Length

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION

Lower Bounds for Local Monotonicity Reconstruction from Transitive-Closure Spanners

Lecture 2: Correlated Topic Model

Tractability results for weighted Banach spaces of smooth functions

How to Minimize Maximum Regret in Repeated Decision-Making

Step 1. Analytic Properties of the Riemann zeta function [2 lectures]

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

Hyperbolic Moment Equations Using Quadrature-Based Projection Methods

Math 180, Exam 2, Fall 2012 Problem 1 Solution. (a) The derivative is computed using the Chain Rule twice. 1 2 x x

Exponential asymptotic property of a parallel repairable system with warm standby under common-cause failure

MATH 566, Final Project Alexandra Tcheng,

Monotonicity for excited random walk in high dimensions

Math 1B, lecture 8: Integration by parts

The Principle of Least Action

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

Online Learning with Partial Feedback. 1 Online Mirror Descent with Estimated Gradient

Chapter 9 Method of Weighted Residuals

Branch differences and Lambert W

arxiv: v4 [cs.ds] 7 Mar 2014

A Weak First Digit Law for a Class of Sequences

Transcription:

Collaborative Raning for Local Preferences Supplement Ber apicioglu Davi S Rosenberg Robert E Schapire ony Jebara YP YP Princeton University Columbia University Problem Formulation Let U {,,m} be the set of users, let V {,,n} be the set of items, an let {,,} inicate the local time hen, the sample space is efine as X {(u, C, i, t) u U,C V,i C, t } () Let P [ ] enote probability, let C i be the set C excluing element i, anletc U C mean that c is sample uniformly from C hen, the local raning loss associate with hypothesis g is L g (u, C, i, t) P [g (u, i, t) g (u, c, t) 0] () c U C i A Boun on the Generalization Error We assume that the hypothesis class is base on the set of low-ran matrices Given a low-ran matrix M, letg M F be the associate hypothesis, where g M (u, i) M u,i hroughout the paper, we abuse notation an use g M an M interchangeably We assume that ata is generate with respect to D, which is an unnown probability istribution over the sample space X, anwelete enote expectation hen, the generalization error of hypothesis M is E L M (u, C, i), which is the quantity we boun (u,c,i) D below We will erive the generalization boun in two steps In the first step, we will boun the empirical Raemacher complexity of our loss class, efine below, with respect to samples that contain exactly caniates, an in the secon step, we will prove the generalization boun with a reuction to the previous step Lemma Let m be the number of users an let n be the number of items Define L r {L M M R m n has ran at most r} as the class of loss functions associate with low-ran matrices Assume that S X is a set of samples, where each sample contains exactly caniate items; ie if (u, C, i) S, then C Let R S (L r ) enote the Raemacher complexity of L r with respect to S hen, 6emn r (m n)ln r(mn) R S (L r ) Proof Because each sample in S contains exactly caniates, any hypothesis L M L r applie to a sample in S outputs either 0 or hus, the set of ichotomies that are realize by L r on S, calle Π Lr (S ),iswell-efine UsingEquation(6) from Boucheron et al [], we now that R S (L r ) ln Π Lr (S ) Let X X be the set of all samples that contain exactly caniates, Π Lr (S ) Π Lr (X ), soitsuffices to boun Π Lr (X ) We boun Π Lr (X ) by counting the sign configurations of polynomials using proof techniques that are influence by Srebro et al [4] Let (u, {i, j},i) X be a sample an let M be a hypothesis matrix Because M has ran at most r, it can be written as M UV, where U R m r an V R n r Let enote an inicator function that is if an only if its argument is true hen, the loss on the sample can also be rewritten as L M (u, {i, j},i)m u,i M u,j 0 UV UV 0 r U u,i u,j u,a (V i,a V j,a ) a 0 Since carinality of X is at most m n mn, putting it all together, it follows that Π Lr (X ) is boune by the number of sign configurations of mn polynomials, each of egree at most, overr (m n) variables Applying Corollary 3 from Srebro et al [4], we obtain Π Lr (X ) 6emn r(mn) r(mn) aing logarithms an maing basic substitutions yiel the esire result We procee to proving the more general result via a reuction to Lemma heorem Let m be the number of users an let n be the number of items Assume that S consists of

Ber apicioglu, Davi S Rosenberg, Robert E Schapire, ony Jebara inepenently an ientically istribute samples chosen from X with respect to a probability istribution D Let L M be the loss function associate with a matrix M, as efine in Equation hen, with probability at least δ, for any matrix M R m n with ran at most r, E L M (u, C, i) (u,c,i) D r (m n)ln 6emn r E L M (u, C, i) (u,c,i) S U ln δ (3) we can boun the empirical local raning loss as E L M (u, C, i) L M (u, C, i) (u,c,i) S U P [M u,i M u,c 0] c U C i E M u,i M u,c 0 c U C i C i UV u,i UV u,c 0 c C i UV u,i UV u,c (4) Proof We will manipulate the efinition of Raemacher complexity [] in orer to use the boun given in Lemma : R S (L r ) E σ E σ sup L M L r sup L M L r σ a L M (u a,c a,i a ) a σ a E U a j a (Ca\{i a}) L M (u a, {i a,j a },i a ) E sup E σ a L M (u a, {i a,j a },i a ) σ L M L r j,,j a E E sup σ a L M (u a, {i a,j a },i a ) σ j,,j L M L r a E sup σ a L M (u a, {i a,j a },i a ) j,,j Eσ L M L r E [R S (L r )] j,,j r (m n)ln a 6emn r(mn) We note that the CLR an the raning SVM [] objectives are closely relate If V is fixe an we only nee to imize U, then each row of V acts as a feature vector for the corresponing item, each row of U acts as a separate linear preictor, an the CLR objective ecomposes into solving simultaneous raning SVM problems In particular, let S u {(a, C, i) S a u} be the examples that correspon to user u, letu u enote row u of U, anlet f rsvm enote the objective function of raning SVM, then f CLR (S; U, V ) U F UV UV u,i u,c m U u F u m u u m f rsvm (S u ; U u,v) u 4 Algorithms UV u,i UV u,c Plugging the boun to heorem 3 in Boucheron et al [] proves the theorem 4 Derivation Let (u, C, i) S be an example, then the corresponing approximate objective function is 3 Collaborative Local Raning Let h (x) max (0, x) be the hinge function, let M be the hypothesis matrix with ran at most r, anlet M UV, where U R m r an V R n r hen, f CLR ((u, C, i);u, V ) V F UV UV u,i u,c We introuce various matrix notation to help us efine the approximate subgraients Given a matrix M, let

Ber apicioglu, Davi S Rosenberg, Robert E Schapire, ony Jebara Algorithm Alternating imization for optimizing the CLR objective Input: raining ata S X,regularizationparameter > 0, ranconstraintr, numberofiterations : U Sample matrix uniformly at ranom from m r mr, mr : V Sample matrix uniformly at ranom from n r nr, nr 3: for all tfrom to o 4: U t arg f CLR (S; U, V t ) U 5: V t arg f CLR (S; U t,v) V 6: return U,V M, enote row of M Definethematrix ˆM p,q,z,for p q, as M z, for s p, ˆM s, p,q,z M z, for s q, (5) 0 otherwise, an efine the matrix ˇM p,q,z s, ˇM p,q,z s, as M p, M q, for s z, 0 otherwise (6) Let enote an inicator function that is if an only if its argument is true hen, the subgraient of the approximate objective function with respect to V is V f CLR ((u, C, i);u, V )V UV UV < Û i,c,u (7) u,i u,c c C i Setting η t t as the learning rate at iteration t, the approximate subgraient upate becomes V t V t η t V f CLR ((u, C, i);u, V ) Aftertheupate,the weights are projecte onto a ball with raius he pseuocoe for optimizing both convex subproblems is epicte in Algorithms an 3 We prove the correctness of the algorithms an boun their running time in the next subsection 4 Analysis he convex subproblems we analyze have the general form X D f (X; ) X D X F (X;(u, C, i)) (8) Algorithm Projecte stochastic subgraient escent for optimizing U Input: Factors V R n r,trainingatas, regularization parameter, ranconstraintr, numberof iterations : U 0 m r : for all tfrom to o 3: Choose (u, C, i) S uniformly at ranom 4: η t t 5: C c C i U t V u,i U t V u,c < 6: U t ( η t ) U t ηt i,c,u ˇV c C 7: U t, Ut U t F 8: return U Algorithm 3 Projecte stochastic subgraient escent for optimizing V Input: Factors U R m r,trainingatas, regularization parameter, ranconstraintr, numberof iterations : V 0 n r : for all tfrom to o 3: Choose (u, C, i) S uniformly at ranom 4: η t t 5: C c C i UVt UV u,i t u,c < 6: V t ( η t ) V t ηt Û i,c,u c C 7: V t, Vt V t F 8: return V One can obtain the iniviual subproblems by specifying the omain D an the loss function Forexample, in case of Algorithm, the corresponing imization problem is specifie by X R m rf (X; V ) where V (X;(u, C, i)) XV u,i XV u,c, an in case of Algorithm 3, it is specifie by X R n rf (X; U ) where U (X;(u, C, i)) UX u,i UX u,c Let U arg U f (U; V ) an V arg V f (V ; U ) enote the solution matrices of Equations 9 an 0, respectively Also, given a general convex loss an omain D, let X D be an (9) (0)

Ber apicioglu, Davi S Rosenberg, Robert E Schapire, ony Jebara -accurate solution for the corresponing imization problem if f X; X D f (X; ) In the remainer of this subsection, we show that Algorithms an 3 are aaptations of the Pegasos [3] algorithm to the CLR setting hen, we prove certain properties that are prerequisites for obtaining Pegasos s performance guarantees In particular, we show that the approximate subgraients compute by Algorithms an 3 are boune an the loss functions associate with Equations 9 an 0 are convex In the en, we plug these properties into a theorem prove by Shalev-Shwartz et al [3] to show that our algorithms reach an -accurate solution with respect to their corresponing imization problems in Õ iterations Lemma U an V Proof One can obtain the bouns on the norms of the optimal solutions by exaing the ual form of the optimization problems an applying the strong uality theorem Equations 9 an 0 can both be represente as v D v e h (f (v)), () where e C is a constant, h is the hinge function, D is a Eucliean space, an f is a linear function We rewrite Equation as a constraine optimization problem v D,ξ R v e ξ () subject to ξ f (v),,, ξ 0, he Lagrangian of this problem is L (v, ξ,, β) v e ξ, ( f (v) ξ ) β ξ v ξ (e β ) ( f (v)), an its ual function is g (, β) infl (v, ξ,, β) v,ξ Since L (v, ξ,, β) is convex an ifferentiable with respect to v an ξ,thenecessaryansufficient conitions for imizing v an ξ are v L 0 v v f (v), ξ L 0 e β (3) We plug these conitions bac into the ual function an obtain g (, β) infl (v, ξ,, β) v,ξ v f (v) f v f (v) v f (v) (4) f v f (v) Since f is a linear function, we let f (v) v, where is a constant vector, an v f (v) hen, v f (v) f f v f (v) (5) Simplifying Equation 4 using Equation 5 yiels g (, β) v f (v) (6) Finally, we combine Equations 3 an 6, an obtain the ual form of Equation, max (7) subject to 0 e,,

Ber apicioglu, Davi S Rosenberg, Robert E Schapire, ony Jebara he primal problem is convex, its constraints are linear, an the omain of its objective is open; thus, Slater s conition hols an strong uality is obtaine Furthermore, the primal problem has ifferentiable objective an constraint functions, which implies that (v, ξ ) is primal optimal an (, β ) is ual optimal if an only if these points satisfy the arush-uhn- ucer () conitions It follows that v (8) Note that we efine e C, where e, an the constraints of the ual problem imply 0 e ;thus, Because of strong uality, there is no uality gap, an the primal an ual objectives are equal at the optimum, v e ξ v his proves the lemma v v (by (8)) Given the bouns in Lemma, it can be verifie that Algorithms an 3 are aaptations of the Pegasos [3] algorithm for optimizing Equations 9 an 0, respectively It still remains to show that Pegasos s performance guarantees hol in our case Lemma 3 In Algorithms an 3, the approximate subgraients have norm at most Proof he approximate subgraient for Algorithm 3 is epicte in Equation 7 Due to the projection step, V F, an it follows that V F he term Û i,c,u is constructe using Equation 5, an it can be verifie that Û i,c,u F U F Using triangle inequality, one can boun Equation 7 with A similar argument can be mae for the approximate subgraient of Algorithm, yieling the slightly higher upper boun given in the lemma statement We combine the lemmas to obtain the correctness an running time guarantees for our algorithms Lemma 4 Let 4,let be the total number of iterations of Algorithm, an let U t enote the parameter compute by the algorithm at iteration t Let Ū t U t enote the average of the parameters prouce by the algorithm hen, with probability at least δ, f Ū; V f (U ; V ) ln δ he analogous result hols for Algorithm 3 as well Proof First, for each loss function V an U,variables are linearly combine, compose with the convex hinge function, an then average All these operations preserve convexity, hence both loss functions are convex Secon, we have argue above that Algorithms an 3 are aaptations of the Pegasos [3] algorithm for optimizing Equations 9 an 0, respectively hir, in Lemma 3, we prove a boun on the approximate subgraients of both algorithms Plugging these three results into Corollary in Shalev-Shwartz et al [3] yiels the statement of the theorem he theorem below gives a boun in terms of iniviual parameters rather than average parameters heorem Assume that the conitions an the boun in Lemma 4 hol Let t be an iteration inex selecte uniformly at ranom from {,,} hen, with probability at least, 4 f (U t ; V ) f (U ; V ) ln δ he analogous result hols for Algorithm 3 as well Proof he result follows irectly from combining Lemma 4 with Lemma 3 in Shalev-Shwartz et al [3] hus, with high probability, our algorithms reach an -accurate solution in Õ iterations Since we argue in Subsection 4 that the running time of each stochastic upate is O (br), it follows that a complete run of projecte stochastic subgraient escent taes Õ br time, an the running time is inepenent of the size of the training ata References [] Stéphane Boucheron, Olivier Bousquet, an Gabor Lugosi heory of classification : A survey of some recent avances ESAIM: Probability an Statistics, 9:33 375, 005

Ber apicioglu, Davi S Rosenberg, Robert E Schapire, ony Jebara [] horsten Joachims Optimizing search engines using clicthrough ata In Proceeings of the eighth ACM SIGDD international conference on nowlege iscovery an ata ing, DD 0, pages 33 4, New Yor, NY, USA, 00 ACM [3] Shai S Shwartz, Yoram Singer, Nathan Srebro, an Anrew Cotter Pegasos: primal estimate subgraient solver for SVM Math Program, 7():3 30, March 0 [4] Nathan Srebro, Noga Alon, an ommi Jaaola Generalization error bouns for collaborative preiction with Low-Ran matrices In NIPS, 004