Learning Preferences with Millions of Parameters by Enforcing Sparsity

Size: px

Start display at page:

Download "Learning Preferences with Millions of Parameters by Enforcing Sparsity"

Letitia Wilkerson
5 years ago
Views:

1 Learning Preferences with Millions of Parameters by Enforcing Sparsity Xi Chen (1)(2) Bing Bai (1) Yanjun Qi (1) Qihang Li (2) Jaime Carbonell (2) (1) NEC Labs America, Princeton, NJ (2) Carnegie Mellon University, Pittsburg, PA {xichen, qihangl, Abstract We study the retrieval task that ranks a set of objects for a given query in the pairwise preference learning framework. Recently researchers found out that raw features (e.g. words for text retrieval) and their pairwise features which describe relationships between two raw features (e.g. word synonymy or polysemy) could greatly improve the retrieval precision. However, most existing methods can not scale up to problems with many raw features (e.g. English vocabulary), due to the prohibitive computational cost on learning and the memory requirement to store a quadratic number of parameters. In this paper, we propose to learn a sparse representation of the pairwise features under the preference learning framework using the L1 regularization. Based on stochastic gradient descent, an online algorithm is devised to enforce the sparsity using a mini-batch shrinkage strategy. On multiple benchmark datasets, we show that our method achieves better performance with fast convergence, and takes much less memory on models with millions of parameters. Keywords-preference learning; sparse model; online learning; learning to rank; text mining I. INTRODUCTION Learning preferences among a set of objects (e.g. documents) given another object as query is a central task of information retrieval and text mining. One of the most natural frameworks for this task is the pairwise preference learning, expressing that one document is preferred over another given the query [1]. Most existing methods [2] learn the preference or relevance function by assigning a real valued score to a feature vector describing a (query, object) pair. This feature vector normally includes a small number of hand-crafted features, such as the BM25 scores for the title or the whole text, instead of the very natural raw features [3]. A drawback of using hand-crafted features is that they are often expensive and specific to datasets, requiring domain knowledge in preprocessing. In contrast, the raw features are easily available, and carry strong semantic information (such as word features in text mining). In this paper we study a basic model presented in [4], [5] which uses the raw word features under the supervised pairwise preference learning framework and consider feature relationships in the model 1. To be specific, let D be the dictionary size, i.e. the size of the query and document feature set 2, given a query q R D and a document d R D, the relevance score between q and d is modeled as: f(q, d) =q Wd = W ij Φ(q i,d j ), (1) i,j where Φ(q i,d j ) = q i d j and W ij models the relationship/correlation between i th query feature q i and j th document feature d j. This is essentially a linear model with pairwise features Φ(, ) and the parameter matrix W R D D is learned from labeled data. Compared to most of the existing models, the capacity of this model is very large because of the D 2 free parameters which can carefully model the relationship between each pair of words. From a semantic point of view, a notable superiority of this model is that it can capture synonymy and polysemy as it looks at all possible cross terms, and can be tuned directly for the task of interest. Although it is very powerful, the basic model in Eq. (1) suffers from the following weakness which hinders its wide application: 1) Memory issue: Given the large dictionary size D, it requires a huge amount of memory to store the W matrix with a size quadratic in D. When D =1,, storing W needs nearly 1Gb of RAM (assuming double); when D =3,, it requires 8Gb of RAM. 2) Generalization ability: Given D 2 free parameters (entries of W ), when the number of training samples is limited, it can easily lead to overfitting. Considering the dictionary with the size D = 1,, wehave D 2 =1 8 free parameters that need to be estimated which is far too many for small corpora. To address the above weakness, we propose to constrain W to be a sparse matrix with many zero entries for the pairs of words which are irrelevant for the preference learning task. If W is a highly sparse matrix, then it consumes 1 For the sake of clarity, we present the model in a text retrieval scenario. We use the term words for features, and query and documents for data instances, depending on their roles. 2 In our model and algorithm, there is no need to restrict that query q and document d have the same feature size D; however we make this assumption for simplicity of explanation.

2 much less memory and has only a limited number of free parameters to be estimated. In other words, a sparse W matrix will allow us to greatly scale up the dictionary size to model those non-frequent words which are often essential for the preference ordering. In addition, we can have faster and more scalable learning algorithm since most entries of W are zeros so that those multiplications can be avoided. Another advantage of learning a sparse representation of W is its good interpretability. The zero W ij indicate that i th query feature and j th document feature are not correlated to the specific task. A sparse W matrix will accurately capture correlation information between pairs of words and the learned correlated word pairs could explain the semantic rationale behind the preference ordering. In order to enforce the sparsity on W, inspired by the success of the sparse linear regression model lasso [6], we impose the entry-wise l 1 regularization on W. It has been shown theoretically that under certain conditions, l 1 regularization can correctly uncover the underlying ground truth sparsity pattern [7]. A practical challenge in using the l 1 regularized model is to develop an efficient and scalable learning algorithm. Since in many preference learning related applications (e.g. search engine), the model needs to be trained in a timely fashion and new (query, document) pairs may be added to the repository at any time, stochastic gradient descent in the online learning framework is the most desirable learning method for our task [8]. To enforce the l 1 regularization, based on [9], we propose to perform a mini-batch shrinking step for every T iterations in the stochastic gradient descent which can lead to the sparse solution. Moreover, to reduce the additional bias introduced by the l 1 regularization, we further propose a refitting step which can improve the preference prediction while keeping the learned sparsity pattern. The key idea of this paper is that: learning on a large number of corpus-independent raw features, or even on combinations of such features, is not impossible. Although the number of involved parameters is intimidatingly large, using sparsity as a powerful tool, the model can be well controlled and efficiently learned. We believe these ideas form a fresh perspective and will benefit many other retrieval related tasks as well. II. BASIC MODEL Let us denote the set of documents in the corpus as {d k } K k=1 RD and the query as q R D, where D is the dictionary size, and the j th dimension of a document/query vector indicates the frequency of occurrence of the j th word, e.g. using the tf-idf weighting and then normalizing to unit length [1]. Given a query q and a document d, we wish to learn a scoring function f(q, d) that measures the relevance of d to q. In this paper, we assume that f is a linear function which takes the form of Eq. (1). Each entry of W represents a relationship between a pair of words. A. Margin Rank Loss Suppose we are given a set of tuples R (labeled data), where each tuple contains a query q, a preferred document d + and an unpreferred (or lower ranked) document d.we would like to learn a W such that q Wd + > q Wd, making the right preference prediction. For that purpose, given tuple (q, d +,d ), we employ the widely adopted margin rank loss [11]: L W (q, d +,d ) h(q T Wd + q T Wd ) (2) = max(, 1 q T Wd + + q T Wd ), where h(x) max(, 1 x) is the well-known hinge loss function as adopted in SVM. Our goal is to learn the W matrix which minimize the loss in Eq. (2) summing over all tuples (q, d +,d ) in R: W 1 L W (q, d +,d ). (3) W R (q,d +,d ) R B. Stochastic Subgradient Descent In general, the size of R is very large and new tuples may be added to R in a streaming manner, which makes it difficult to directly train the objective in Eq. (3). To overcome this challenge, we adopt stochastic (sub)gradient descent (SGD) in an online learning framework [12], Specifically, at each iteration, we randomly draw a sample (q, d +,d ) from R, compute the subgradient 3 of L W (q, d +,d ) with respect to W as following: L W (q, d +,d ) { q(d + d ) if q W (d + d ) < 1 =, (4) otherwise and then update the W matrix accordingly. It has been shown that SGD achieves fast learning on large scale datasets [8]. We suggest to initialize W to the identity matrix as this initializes the model to the same solution as a cosine similarity score. The strategy introduces a prior expressing that the weight matrix should be close to the identity matrix. We consider term correlations only when it is necessary to increase the score of a relevant document, or conversely, decrease the score of a irrelevant document. As for the learning rate η t, we suggest to adopt a decaying learning rate: η t = C t, where C is a pre-defined constant as the initial learning rate. Intuitively, it should be better than the fixed learning rate since at the beginning, when W is far away from the optimal solution W, a larger learning rate is desirable since it leads to a significant decrease of the objective value. On the other hand, when W gets close to W, a smaller rate should be adopted to avoid missing the optimal solution. We will show the advantage of the 3 Since L is a non-smooth function, it does not have the gradient but only has the subgradient.

3 decaying learning rate scheme over the fixed one in the experiment section. III. PREFERENCE LEARNING WITH SPARSITY As discussed in the introduction, the model and the learning algorithm in the previous section will lead to a dense W matrix which consumes a large amount of memory and has poor generalization ability for small corpora. To address these problems, we can learn a sparse model with a small number of nonzero entries of W. In order to obtain a sparse W, inspired by [6], we add an entry-wise l 1 norm to the W as a regularization term to the loss function. We propose to optimize the following objective function: W 1 L W (q, d +,d )+λ W 1, W R (q,d +,d ) R (5) where W 1 = D i,j=1 W ij is the entry-wise l 1 norm of W and λ is the regularization parameter which controls the sparsity level (the number of nonzero entries) of W.In general, a larger λ leads to a more sparse W. On the other hand, a too sparse W will miss some useful relationship information among word pairs (considering diagonal W as an extreme case). Therefore, in practice, we need to tune λ to obtain a W with a suitable sparsity level. A. Training the Sparse Model To optimize Eq. (5), we adopt a variant of the general sparse online learning scheme in [9]. In [9], after updating W t at each iteration in SGD, a shrinkage step is performed by solving the following optimization problem: Ŵ t 1 W 2 W W t 2 F + λη t W 1, (6) and then use Ŵ t as the starting point for the next iteration. In Eq. 6, F denote the matrix Frobenius norm and η t is the decaying learning rate for the t th iteration. According to the next proposition, we know that performing (6) will shrink those Wij t with an absolute value less than λη t to zero and hence lead to a sparse W matrix. Proposition 1. The solution Ŵ t to the optimization problem in Eq. (6) takes the following form: Wij t + λη t if Wij t < λη t Ŵij t = if λη t Wij t λη t (7) Wij t λη t if Wij t >λη t Although performing the shrinkage step leads a sparse W solution, it is very expensive for a large dictionary size D. For example, when D =1,, we need D 2 =1 8 operations. Therefore, we suggest to perform the shrinkage step for every T iteration cycles. In general, a smaller T guarantees that the shrinkage step can be done in a timely fashion so that the entries of W will not grow too large to produce inaccurate L W (q, d +,d ); on the other hand, a smaller T increases the computational cost of the training process. In practice, we suggest to set T = 1. The details of the algorithm are presented in Algorithm 1. Note that when t is a multiple of T, we perform the shrinkage step with a cumulative regularization parameter for W 1 from t T +1 to t: λ t k=t T +1 η t. The reason why we adopt cumulative regularization parameter is due to the following simple fact that: W T obtained by solving a sequence of successive optimization problems: W t 1 W 2 W W t 1 2 F + λη t W 1, t =1,...T is identical to the one by solving the following single optimization problem: W T W 1 2 W W 2 F + λ T η t W 1. t=1 The above equivalence can be easily proved using Proposition 1 as shown in [9]. Although we take the gradient update so that Algorithm 1 is not exactly identical to the one taking the shrinkage step at every iteration, empirically, the learned sparse W from Algorithm 1 is a good approximation. Algorithm 1 Initialization: W R D D, T, learning rate sequence {η t }. Iterate for t =1, 2,... until convergence of W t : 1) Randomly draw a tuple (q, d +,d ) R 2) Compute the subgradient of L W t 1(q, d +,d ) with respect to W : L W t 1(q, d +,d ) 3) Update W t = W t 1 η t L W t 1(q, d +,d ) 4) If (t mod T =) W t W 1 2 W W t 2 F + λ t k=t T +1 η t W 1 B. Refitting the Sparse Model From Eq. (7), we see that l 1 regularization will not only shrink the weights for uncorrelated word pairs to zero but also reduce the absolute value of the weights for correlated word pairs. This additional bias introduced by l 1 regularization often harm the prediction performance. In order to reduce this bias, we propose to refit the model without l 1 regularization, but enforcing the sparsity pattern of W obtained from Algorithm 1. More precisely, after learning the sparse Ŵ from Algorithm 1, let Ω be the indices of nonzero entries of Ŵ, i.e. Ω={(i, j) Ŵij }. Given a matrix W R D D, let P Ω (W ) R D D be the { matrix defined as following: W ij if (i, j) Ω P Ω (W ) ij = if (i, j) Ω, where P Ω is called the projection operator which projects W onto Ω.

4 In the refitting step, given Ω, we try to minimize the following objective function: W 1 L PΩ(W )(q, d +,d ), (8) W R (q,d +,d ) R We still adopt SGD to minimize Eq. (8), but replace L W (q, d +,d ) with L PΩ(W )(q, d +,d ). Using the chain rule for subgradient, we can show that L PΩ(W )(q, d +,d ) takes the following form: L PΩ(W )(q, d +,d ) { P Ω (q(d + d ) ) if q P Ω (W )(d + d ) < 1 =. otherwise In the experiment section, we show that the prediction performance gets improved after the refitting step. IV. EXPERIMENT A. Experiment Setup Pairwise preference learning discussed in this paper belongs to a more general framework Learning to rank, which is a key topic in the research of information retrieval [2]. Learning to rank methods are usually evaluated using standard benchmark data like TREC 4 data or LETOR [3]. However, TREC has only a limited number of queries available, which makes the training of a large number of features very difficult. On the other hand, LETOR and most of the other learning to rank datasets (e.g. Microsoft Learning to Rank Datasets 5, Yahoo! Learning to Rank Challenge Datasets 6 ) have only few hundred (or even less) features such as BM25 or pagerank scores instead of the actual word features, and are therefore not adequate for evaluating our methods. It would be ideal to test the proposed method on click-through data from web search logs, but such data are not publicly available. As pointed out by a seminal paper [13], preference learning and multiclass classification can be modeled in a unified framework. It is natural to adopt the multiclass classification (with many different classes) datasets to evaluate the our proposed preference learning models. More precisely, we construct training samples (q, d +,d ) R where q and d + are in the same class while q and d belong to different classes. In our experiment, we use several benchmark multiclass classification datasets, including the text datasets 2 Newsgroups 7 (2NG), RCV1 8 [15] and digital recognition dataset MNIST 9. For 2NG and RCV1, we adopt the normalized tf-idf of the 1, most frequent words as the document features. For MINST, the normalized grey scale We adopt the preprocessing method in [14], remove the multi-labelled instances and result in 53 different classes. 9 Table I THE STATISTICS OF THE EXPERIMENTAL DATASETS 2NG RCV1 MNIST No. of training samples 11,314 15,564 6, No. of testing samples 7, ,571 1, No. of Classes Dictionary Size D 1, 1, 784 No. of Free Parameters , 656 pixel values are used as features. Some statistics of these datasets are shown in Table I. For the experiments, we use the cosine similarity as the baseline, i.e. W = I and mainly compare the following methods: 1) W is a diagonal matrix. 2) W is unconstrained and trained by SGD with the fixed learning rate η =.1. 3) W is unconstrained and trained by SGD with the decaying learning rates η t = C t where C = 2. 4) W is sparse and trained by Algorithm 1 with the decaying learning rates η t = C t where C = 2 and then perform the corresponding refitting step 1. Note that for the decaying learning rate case, we try a wide range of starting rate C and find that C = 2 can provide us the most rapid decrease of the objective value. So C is set to be 2 through out the paper. As for the regularization parameter λ, when the dictionary size is large, say D =1, for the text data, we choose λ which leads to the 5% to 1% density (percentage of nonzero entries) of W. When D is relatively small, say 748 for the MNIST dataset, we set λ so that the density of W is roughly 5%. This way of setting λ provides us a sparse enough W which has the advantage of the memory savings and being easy to interpret. In the meanwhile, we have enough nonzero entries of W to guarantee a reasonably good preference learning performance on the testing datasets. We compare the performance of each method by the following metrics: 1) Test error rate: for a testing tuple (q, d +,d ), if q T Wd + q T Wd, test error is increased by 1 and normalized by the test sample size. 2) Mean average precision (MAP) [16]. And we also provide the semantic information encoded in the W matrix for the text data. B. Results 1) Learning Curves: It has been shown that test error rate is monotonic to area under ROC curve (AUC). In Figure 1, we show the curves of the test error rate and 1 For the fair comparison, we use exactly the same training samples in the refitting step as in the sparse learning procedure without any additional samples.

5 Figure 1. The test error rate and the density W for 3 benchmark datasets Test Error Rate (%) (Refitting) Test Error Rate (%) (Refitting) Test Error Rate (%) 5 5 (Refitting).5 (a) The test error curve for 2NG (b) The test error curve for RCV1.5 (c) The test error curve for MNIST Density of W (%) (d) The density of W for 2NG Density of W (%).5.3 (e) The density of W for RCV1 Density of W (%) (f) The density of W for MNIST the corresponding density of W vs. the number of training iterations, i.e. the number of accesses of the training data. Each row in Figure 1 corresponds to one dataset. We have the following observations: 1) For SGD with the fixed learning rate, the test error rate decreases very slowly, and is relatively flat compared to other methods. 2) The schemes with decaying learning rates (including two sparse models, and the dense model ) have similarly good convergence rate. 3) Using the same regularization parameter, Sparse SGD achieves the most sparse model. For dense models, reaches a more sparse model than using naive method SGD (Fixed Learning Rate). This is another evidence that decaying learning rates are superior to fixed learning rates. 4) Refitting the sparse models clearly achieves the best test error rate while keeping the low density of the W matrix. 2) Retrieval Performance and Memory: In this section, we present mean average precision (MAP), test error rate and the memory space of W of each method after training 1, iterations. The results are presented in Table II. We use the abbreviation for each method due to the space limitation of the table. Identity stands for W = I and Diagonal means that W is a diagonal matrix where only the diagonal entries are learned. SGD FLR means SGD with the fixed learning rate, while SGD DLR means SGD with the decaying learning rate. Both of them are trained on the basic model without the l 1 regularization. Sparse and Sparse-R are sparse models, without refitting and with refitting, respectively. The results on MAP are essentially consistent to those on the test error rate. The sparse method with decaying learning rate has similar performance compared to its dense counterpart but takes much less memory. Moreover, sparse models after refitting achieves the best performance 3) Anecdotal Evidences: The sparse model can also provide much useful semantic information. For example, we can easily infer the most related word pair between query word and document word from the sparse W matrix. More precisely, each row of W represents the strength of the correlation (either positive or negative) of document words to a specific query word. Given the i th query word, we sort the absolute value of the i th row of W in a descending order and pick the first few nonzero entries. Those selected entries of W represent the most correlated document words to the given i th query word. In Table III, we present the query words from different categories and the five most correlated document words from the learned sparse W in Eq. (5). We can see that most correlated document words are clearly from the same topics as the query words. It also provide us some interesting semantic information. For example, in 2NG, fbi is closely related to handgun ; colorado to hockey which indicates that there might be a popular hockey team in Colorado (in fact,

6 Table II RETRIEVAL PERFORMANCE AND MEMORY. ITEMS IN BOLD FONTS ARE THE BEST AMONG METHODS TESTED. (a) 2NG MAP Error Mem. (MB) Identity Diagonal SGD FLR SGD DLR Sparse Sparse-R (b) RCV1 MAP Error Mem. (MB) Identity.38 3 Diagonal SGD FLR SGD DLR Sparse Sparse-R (c) MNIST MAP Error Mem. (MB) Identity Diagonal SGD FLR SGD DLR Sparse Sparse-R Table III THE EXAMPLES OF LEARNED RELATED WORD PAIRS IN 2NG Query word clinton cpu graphics handgun hockey motorcycle religions Five most related document words clinton government health people gay mac drive scsi card jon graphics tiff image color polygon gun weapons handgun militia fbi hockey game espn colorado team bike brake turbo rpi cylinder god religions bible christian jesus it is Colorado Avalanche team); government to clinton which indicates that Clinton might be a famous politician (The former US president Bill Clinton). V. CONCLUSIONS In contrast to the traditional preference learning or learning to rank with a few hundred hand-crafted features, our basic model directly performs learning on actual words and considers their pairwise relationships between query and document. Although the pairwise relationship of words could improve and provide us additional semantic information, the basic model suffers from storage overloads and parameter overfitting. To overcome these drawbacks, we introduce sparsity to the model which is achieved by the l 1 regularization with an efficient online learning algorithm. We show with multiple benchmark datasets that our method achieves good performance with fast convergence, while remaining sparse during training (small memory consumption), on a model with hundreds of millions of parameters. For future work, we will explore extended models to learn group structure or even hierarchical structure of the words using group lasso [17] type of regularization. The prior knowledge on the structure of W can also be imposed. REFERENCES [1] J. Fürnkranz and E. Hüllermeier, Pairwise preference learning and ranking, in ECML, 23. [2] T.Y.Liu, Learning to Rank for Information Retrieval. Now Publishers Inc, 29. [3] T. Liu, J. Xu, T. Qin, W. Xiong, and H. Li, Letor: Benchmark dataset for research on learning to rank for information retrieval, in SIGIR 27 Workshop on Learning to Rank for Information Retrieval, 27. [4] D. Grangier and S. Bengio, A discriminative kernel-based approach to rank images from text queries, IEEE Trans. PAMI., vol. 3, no. 8, pp , 28. [5] B. Bai, J. Weston, R. Collobert, and D. Grangier, Supervised semantic indexing, in ECIR, 29. [6] R.Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Series B, vol. 58, pp , [7] M. J. Wainwright, Sharp thresholds for noisy and highdimensional recovery of sparsity using l 1-constrained quadratic programming (lasso), IEEE Tran on Information Theory, vol. 55, pp ,, 29. [8] L. Bottou and Y. LeCun, Large-scale on-line learning, in Advances in Neural Information Processing Systems 15. MIT Press, 24. [9] J. Duchi and Y. Singer, Efficient learning using forwardbackward splitting, in Advances in Neural Information Processing Systems 23, 29. [1] R. Baeza-Yates, B. Ribeiro-Neto et al., Modern information retrieval. Addison-Wesley Harlow, England, [11] R. Herbrich, T. Graepel, and K. Obermayer, Large margin rank boundaries for ordinal regression. MIT Press, Cambridge, MA, 2. [12] M. Zinkevich, Online convex programming and generalized infinitesimal gradient ascent, in ICML, 23. [13] F.Aiolli and A. Sperduti, Learning preferences for multiclass problems, in Advances in Neural Information Processing Systems 18, 24. [14] R. Bekkerman and M. Scholz, Data weaving: Scaling up the state-of-the-art in data clustering, in CIKM, 28. [15] D. Lewis, Y.Yang, T. Rose, and F. Li, Rcv1: A new benchmark collection for text categorization, Journal of Machine Learning Research, vol. 5, pp , 24. [16] G. Cormack and T. Lynam, Statistical precision of information retrieval evaluation, in SIGIR, 26. [17] M. Yuan and Y. Lin, Model selection and estimation in regression with grouped variables, Journal of Royal Statistical Society, Series B, vol. 68(1), pp , 26.

Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications

Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications Pengtao Xie, Machine Learning Department, Carnegie Mellon University 1. Background Latent Variable Models (LVMs) are