BINARY STABLE EMBEDDING VIA PAIRED COMPARISONS. Andrew K. Massimino and Mark A. Davenport

Size: px

Start display at page:

Download "BINARY STABLE EMBEDDING VIA PAIRED COMPARISONS. Andrew K. Massimino and Mark A. Davenport"

Gloria Cole
5 years ago
Views:

1 06 IEEE Statistical Signal Processing Workshop SSP BINAY STABLE EMBEDDING VIA PAIED COMPAISONS Andrew K. Massimino and Mark A. Davenport Georgia Institute of Technology, School of Electrical and Computer Engineering ABSTACT Suppose that we wish to estimate a vector x from a set of binary paired comparisons of the form x is closer to p than to q for various choices of vectors p and q. The problem of estimating x from this type of observation arises in a variety of contexts, including nonmetric multidimensional scaling, unfolding, and ranking problems, often because it provides a powerful and flexible model of preference. The main contribution of this paper is to show that under a randomized model for p and q, a suitable number of binary paired comparisons yield a stable embedding of the space of target vectors. Index Terms paired comparisons, binary stable embeddings, recommendation systems, -bit compressive sensing. INTODUCTION The central problem we consider in this paper is the estimation of a vector x n where, rather than directly observing x, we assume that we are restricted to m observations of the form x is closer to p i than q i, where p i, q i n correspond to points whose locations are known. Each of these comparisons essentially divides n in half and tells us which side of a hyperplane the point x lies. Observations of this form arise in a variety of contexts, but a particularly important class of applications involve recommendation systems, targeted advertisement, and psychological studies where x represents an ideal point that models a particular user s preferences and the p i and q i represent items that the user compares []. Items which are close to x are those most preferred by the user. Paired comparisons arise naturally in this context since precise numerical scores quantifying a user s preference are generally much more difficult to assign than comparative judgements [, 3]. Moreover, data consisting of paired comparisons is often generated implicitly in contexts where the user has the option to act on two or more alternatives; for instance they may choose to watch a particular movie, or click a particular advertisement, out of those displayed to them [4]. In such contexts, the true distances in the ideal point model s preference space are generally inaccessible in any direct way, but it is nevertheless still possible to obtain a good estimate of a user s ideal point. This work was supported by grants NL N C00, AFOS FA , NSF CCF-35066, CCF , and CMMI The fundamental question which interests us in this paper is how many comparisons suffice in order to guarantee that the number of differing paired comparisons generated by x and y is roughly proportional to the Euclidean distance between x and y. This allows us to understand how well x can potentially be estimated from highly quantized information and gives an idea of the stability in the presence of labeling errors. We consider the case where we are given an existing embedding of the items as in a mature recommender system and focus on the on-line problem of locating a single new user from their feedback consisting of binary data generated from paired comparisons. The item embedding could be generated using a variety of methods, such as multidimensional scaling applied to a set of item features, or even using the results of previous paired comparisons via an approach like that in [5]. We wish to understand how many comparisons are then required to accurately localize a user s ideal point. Any precise answer to this question would depend on the underlying geometry of the item embedding since some sets of hyperplanes will yield better tesselations of the preference space than others. Thus, to gain some intuition on this problem without reference to the geometry of a particular embedding, we will instead consider a probabilistic model where the items are generated at random from a particular distribution. It is important to note that the ideal point model, while similar, is distinct from the low-rank factor or attribute model used in the matrix completion approaches which have recently gained much attention as applied to recommendation systems, e.g., [6, 7]. Although both models suppose user choices are guided by a number of attributes, the ideal point model leads to preferences that are non-monotonic functions of those attributes. There is also empirical evidence that the ideal point model captures user behavior more accurately than factorization based approaches do [8, 9]. There is a large body of work that studies the problem of learning to rank items from various sources of data, including paired comparisons of the sort we consider in this paper. See, for example, [0,, ] and references therein. We first note that in most work on rankings, the central focus is on learning a correct rank-ordered list for a particular user, without providing any guarantees on recovering a correct parameterization for the user s preferences as we do here. Perhaps most /6/$ IEEE

2 06 IEEE Statistical Signal Processing Workshop SSP closely related to our work is that of [0, ], which examines the problem of learning a rank ordering using the same ideal point model considered in this paper. The message in this work is broadly consistent with ours, in that the number of comparisons required should scale with the dimension of the preference space not the total number of items. Also closely related is the work in [3, 4, 5] which consider paired comparisons and more general ordinal measurements in the similar but as discussed above, subtly different context of low-rank factorizations. Finally, while seemingly unrelated, we note that our work builds on the growing body of literature on -bit compressive sensing. In particular, our results are largely inspired by those in [6, 7], and borrow techniques from [8] in the proofs of some of our main results.. THE ANDOM OBSEVATION MODEL For the moment we will consider the noise-free setting where a user always prefers the item closest to the user s ideal point x. In this case we can represent the observed comparisons mathematically by letting A i x denote the i th observation, which consists of comparisons between p i and q i, and setting x q i x p i A i x :=sign = { + if x is closer to p i if x is closer to q i. We will also use Ax =[A x,, A m x] T to denote the vector of all observations resulting from m comparisons. Note that if we set a i =p i q i and τ i = p i q i, then we can re-write our observation model as A i x =sign a T i x τ i =sign a T i x τ i.. This is reminiscent of the one-bit compressive sensing setting with dithers [6, 7] with the important differences that: i we have not yet made any kind of sparsity or other structural assumption on x and, ii the dithers τ i, at least in this formulation, are dependent on the a i, which results in difficulty applying existing results from this theory to this setting. However, many of the techniques from this literature will nevertheless be helpful in analyzing this problem. To see this, we consider a randomized observation model where the pairs p i, q i are chosen independently with i.i.d. entries drawn according to a normal distribution, i.e., p i, q i N0,σ I. In this case, we have that the entries of our sensing vectors are i.i.d. with a ij N0, σ. Moreover, if we define b i = p i + q i, then we also have that b i N0, σ I, and we can write τ i = at i b i. Note that while τ i is clearly dependent on a i, we do have that a i and b i are independent. To further simplify things, let s re-normalize by dividing by a i, i.e., setting ã i = a i / a i and τ i = τ i / a i and A i x =sign ã T i x τ i. It is easy to see that ã i is distributed uniformly on the sphere S n = {a n : a =}. Note also that we can write τ i = ãt i b i. Since a i and b i are independent, ã i and b i are also independent. Moreover, for any unit-vector ã i,if b i N0, σ I then ã T i b i N0, σ. Thus, we must have τ i N0,σ /, independent of ã i, which is the key insight that enables the analysis below. 3. MAIN ESULT Under the model described above, we would like to understand how much the sign patterns of two different signals can differ from their Euclidean distance. Our main result is that given enough comparisons there is an approximate embedding of the preference space into {, } m via our measurement model. We assume any signal of interest has norm at most, i.e., we say signals x, y B n, the n-ball of radius. Asan example application, x might be considered the true user s ideal point and y some other possible estimate. Theorem 3. states that if x and y are sufficiently nearby, the respective sign-measurement patterns Ax and Ay closely match. We denote by d H the Hamming distance, i.e., d H counts the number of {, +} sign measurement errors; d H Ax, Ay := m i= A ix A i y. 3. Note that the following result applies for all x and y, and follows from Lemma 3., which in contrast applies only for fixed x and y. Theorem 3.. Let η>0 and ζ>0. Suppose m total measurements of the form. may be taken, i.e., each a i is drawn uniformly on the unit sphere and τ m is a vector with entries τ i N0,σ / independent of the a i, where σ =/n. If m ζ n log 3 ζ +log. η Then there are constants c,c,c 3, such that with probability at least η, for all points x, y B n, c x y c ζ + c + n + ζ c 3 + π +. Proof. By Lemma 3., for any fixed pair w, z B n we have bounds on the Hamming distance with probability at least exp ζ m, for all u B δ w and v B δ z. Since the radius ball can be covered with a set U of radius δ balls with U 3/δ n, by a union bound we have with

3 06 IEEE Statistical Signal Processing Workshop SSP probability at least 3/δ n exp ζ m, for all w, z U, for all u B δ w and v B δ z, c w z c δ ζ d HAu, Av c 3 w z + δ Since x y δ w z x y +δ, c x y c δ c δ n ζ + c 3δ + δ n π + ζ. Letting δ = ζ/ this becomes c x y ζ c + c + n + ζ c 3 + Lower bounding the probability by η, 3/δ n exp ζ m η. earranging, we have the desired result. π +. n π + ζ. We comment that Theorem 3. concerns a particular choice of the variance parameter σ. A natural question is what would happen with a different choice of σ. In fact, this assumption is critical intuitively, if σ is too small, then nearly all the hyperplanes induced by the comparisons will pass very close to the origin, so that accurate estimation of even x becomes impossible. On the other hand, if σ is too large, then an increasing number of these hyperplanes will not even intersect the ball of radius in which x is presumed to lie, thus yielding no new information. Lemma 3.. Let w, z B n be fixed. Let m measurements of the form. be taken in which each a i is drawn uniformly on the unit sphere and let τ m be a vector with entries τ i N0,σ / independent of the a i.fix ζ>0, δ>0, and define B δ w :={u B n : u w δ}. Then there are constants c,c, and c 3 such that with probability at least exp ζ m, for all u B δ w and v B δ z, c w z c δ ζ d HAu, Av c 3 w z where d H is the Hamming distance δ n π + ζ, Proof. Fix δ>0and let u B δ w, v B δ z. ecall that the Hamming distance d H is a sum of independent identically distributed Bernoulli random variables and we may bound it using Hoeffding s inequality. Since our probabilistic upper and lower bounds must hold for all u, v as described above, we introduce quantities L 0 and L which represent two extreme cases of the Bernoulli variables: L 0 := L := Then we have sup u B δ w,v B δ z inf u B δ w,v B δ z m m A i u A i v i= A i u A i v. i= L d H Au, Av L 0 Denote P 0 = E L 0 and P = E L, i.e., P 0 = P{ u B δ w, v B δ z :A i u =A i v} P = P{ u B δ w, v B δ z :A i u A i v}. We give a lower bound for P 0 in Lemma 3.3 and for P in Lemma 3.4. Now by Hoeffding s inequality, P{L 0 > P 0 +ζ} exp mζ P{L <P ζ} exp mζ. Hence, with probability at least exp mζ, P ζ d H Au,Av P 0 +ζ. Since P w z 4δ 6, πe and recalling σ = /n, P 0 Γ n δ σπ Γ n+ w z + σ π π δ n w z + π n π w z = + δ n π π, the lemma is satisfied with appropriate c,c,c Probability estimates Lemma 3.3. Let w, z B n be distinct. Fix δ>0. Denote by P 0 the probability that all points u and v that are within δ of w and z respectively do not differ in the random measurement denoted by A i i.e., the two δ-balls are separated by hyperplane i. The direction and threshold of hyperplane i are denoted by a and τ respectively. That is, P 0 = P{ u B δ w, v B δ z :A i u =A i v}.

4 06 IEEE Statistical Signal Processing Workshop SSP Then P 0 σπ Γ n δ Γ n+ w z + σ π. Proof. This result will require the following integral; a T w z νda = Γ n w z, S π n Γ n+ where ν is the uniform probability measure on the unit sphere. We need a lower bound on P 0 which is equivalent to an upper bound on P 0 = P{A i u A i v for some u B δ w, v B δ z}. Assume a T w > a T z. Then this probability is just But P{a T v <τ<a T u for some u B δ w,v B δ z} = P{ min v B δ z at v <τ< max u B δ w at u} min v B δ z at v a T z δ, so we have max u B δ w at u a T w + δ, P 0 P{a T z δ<τ<a T w + δ}. = a T Φ w + δ a T S n σ/ z δ σ/ νda S σ a T w z +δ νda π n σ a T w z νda+ δ π S σ π n Γ n δ σπ Γ n+ w z + σ π. Lemma 3.4 Probability of separation with a single random measurement. Let w, z B n be distinct. Fix δ>0. Denote by P the probability that all points u and v that are within δ of w and z respectively differ by a random measurement denoted by A i i.e., the two δ-balls are separated by hyperplane i. The direction and threshold of hyperplane i are denoted by a and τ respectively. That is, P = P{ u B δ w, v B δ z :A i u A i v}. Set ɛ 0 w z. Then, P ɛ 0 4δ 6 πe. Proof. Assume = w z, otherwise swap w and z in the argument that follows. Denote Δ=w zand ɛ = Δ. If B δ w B δ z, then the chance that a hyperplane with orientation a and offset τ splits the two balls is zero. Hence, we may consider a restriction to the portion of the sphere where a T w z δ. Further, since the distribution of τ is symmetric, we can restrict to C α = {a : a T w z αɛ} for some α δ and double the integral. Hence C α is a hyper-spherical cap of height αɛ. P = P{a T z + δ τ a T w δ a T w + δ τ a T z δ} a T Φ w δ a T σ/ z + δ σ/ ωda C α To obtain a lower bound, we consider the area of an arbitrary subset C α C α and multiply this by the minimum value of the integrand over that set. Let W = {a : a T w ξ w δ} and Z = {a : a T z ξ z δ}. Note that for any a Z c, a T z + δ ξ and for any a W c, a T w δ ξ w + δ ξ + δ by this assumption. W c and Z c are each the complement of a spherical cap and are of equal area. The set we are considering is C α = C α W c Z c. One can show, by properties of the normal distribution, a T min a C α Φ w δ a T z + δ σ σ a T w z δ σ φ σ ξ. C α has maximal area when W = Z, i.e. when z = ɛ 0 w. However, this will provide a lower bound for arbitrary w and z. Setting α and ξ will effect a trade-off between the area of the spherical section C α and the Gaussian density. We set ξ = ξ /, and α = α /. ecall by assumption σ = /n. Then the lower bound above becomes α ɛ 0 n δ n φξ = α ɛ 0 δ φξ = α ɛ 0 δ exp ξ /. π By integrating, it can be shown that the normalized area of C α is bounded by νc α 4 3 ξ α exp ξ /. Combining the two previous formulae, we have P α ɛ 0 δ ξ α 4 6π exp ξ. We obtain the stated result by setting ξ =/ and α = /.

5 06 IEEE Statistical Signal Processing Workshop SSP 4. EFEENCES [] C. Coombs, Psychological scaling without a unit of measurement, Psych. ev., vol. 57, no. 3, pp , 950. [] G. Miller, The magical number seven, plus or minus two: some limits on our capacity for processing information., Psychological review, vol. 63, no., pp. 8, 956. [3] H. David, The method of paired comparisons, London, UK: Charles Griffin & Company Limited, 963. [4] F. adlinski and T. Joachims, Active exploration for learning rankings from clickthrough data, in Proc. ACM SIGKDD Int. Conf. on Knowledge, Discovery, and Data Mining KDD, San Jose, CA, Aug [5] S. Agarwal, J. Wills, L. Cayton, G. Lanckriet, D. Kriegman, and S. Belongie, Generalized non-metric multidimensional scaling, in Proc. Int. Conf. Art. Intell. Stat. AISTATS, San Juan, Puerto ico, Mar [4] D. Park, J. Neeman, J. Zhang, S. Sanghavi, and I. Dhillon, Preference completion: Large-scale collaborative ranking form pairwise comparisons, in Proc. Int. Conf. Machine Learning ICML, Lille, France, July 05. [5] S. Oh, K. Thekumparampil, and J. Xu, Collaboratively learning preferences from ordinal data, in Proc. Adv. in Neural Processing Systems NIPS, Montréal, Québec, Dec. 04. [6] K. Knudson,. Saab, and. Ward, One-bit compressive sensing with norm estimation, IEEE Trans. Inform. Theory, vol. 6, no. 5, pp , 06. [7]. Baraniuk, S. Foucart, D. Needell, Y. Plan, and M. Wootters, Exponential decay of reconstruction error from binary measurements of sparse signals, arxiv: , 04. [8] L. Jacques, J. Laska, P. Boufounos, and. Baraniuk, obust -bit compressive sensing via binary stable embeddings of sparse vectors, IEEE Trans. Inform. Theory, vol. 59, no. 4, pp. 08 0, 03. [6] J. ennie and N. Srebro, Fast maximum margin matrix factorization for collaborative prediction, in Proc. Int. Conf. Machine Learning ICML, Bonn, Germany, Aug [7] E. Candès and B. echt, Exact matrix completion via convex optimization, Found. Comput. Math., vol. 9, no. 6, pp , 009. [8] B. Dubois, Ideal point versus attribute models of brand preference: a comparison of predictive validity, Adv. Consumer esearch, vol., no., pp , 975. [9] A. Maydeu-Olivares and U. Böckenholt, Modeling preference data, The SAGE handbook of quantitative methods in psychology, pp. 64 8, 009. [0] K. Jamieson and. Nowak, Active ranking using pairwise comparisons, in Proc. Adv. in Neural Processing Systems NIPS, Granada, Spain, Dec. 0. [] K. Jamieson and. Nowak, Low-dimensional embedding using adaptively selected ordinal data, in Proc. Allerton Conf. Communication, Control, and Computing, Monticello, IL, 0. [] F. Wauthier, M. Jordan, and N. Jojic, Efficient ranking from pairwise comparisons, in Proc. Int. Conf. Machine Learning ICML, Atlanta, GA, June 03. [3] Y. Lu and S. Negahban, Individualized rank aggregration using nuclear norm regularization, arxiv: , 04.

arxiv: v1 [cs.it] 21 Feb 2013

arxiv: v1 [cs.it] 21 Feb 2013 q-ary Compressive Sensing arxiv:30.568v [cs.it] Feb 03 Youssef Mroueh,, Lorenzo Rosasco, CBCL, CSAIL, Massachusetts Institute of Technology LCSL, Istituto Italiano di Tecnologia and IIT@MIT lab, Istituto