A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank Shoaib Jameel Shoaib Jameel 1, Wai Lam 2, Steven Schockaert 1, and Lidong Bing 3 1 School of Computer Science and Informatics, Cardiff University. 2 Dept. of Systems Engineering and Engineering Management, The Chinese University of Hong Kong. 3 Machine Learning Department, Carnegie Mellon University. October 12, 2015 ONE LINE SUMMARY OF THE WORK Incorporating the latent topic similarity feature in the regularized topic space helps improve standard learning-to-rank results. This work will be presented at the 24 th ACM International Conference on Information and Knowledge Management (CIKM 2015) on October 20, 2015 in Melbourne, Australia. Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 1 / 32

TABLE OF CONTENTS 1 Background 2 Introduction 3 Related Topic Models Unsupervised Latent Dirichlet Allocation (LDA) Supervised Latent Dirichlet Allocation (slda) Maximum Margin Models Maximum Margin Supervised Topic Models (MedLDA) 4 Our Pairwise Maximum Margin Learning to Rank Model Posterior Regularization Bayesian Posterior Regularization LDA as a Regularized Bayesian Model Regularized Bayesian Learning to Rank Parameter Estimation 5 Experiments and Results Comparative Models and Evaluation Metric NDCG Results 6 Conclusions Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 2 / 32

Background: Ad-Hoc Retrieval Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 3 / 32

INTRODUCTION What is? Maximum margin learning-to-rank for document retrieval models are well known for their generalizeability. Probabilistic topic models have shown to give good performance on ad-hoc document retrieval [Wei and Croft, 2006]. Why not incorporate both frameworks in a model? A Common Approach 1 Train an existing topic model such as Latent Dirichlet Allocation (LDA). 2 Compute a topic-based similarity between the query and a document. 3 Use the similarity score as one of the features. 4 Train the learning-to-rank models. A Problem Such techniques are inherently sub-optimal due to error propagation. Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 4 / 32

LEARNING-TO-RANK FRAMEWORK Three approaches to learning to rank: 1 Pointwise - Ranking as a regression problem. 2 Pairwise - Binary classification problem. 3 Listwise - List optimization problem. Query-document pairs are represented by a set of features. These features are computed prior to conducting the learning. Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 5 / 32

UNSUPERVISED LATENT DIRICHLET ALLOCATION (LDA) α β φ θ d z w LDA [Blei et al., 2003] posits that documents correspond to mixture of topics. Topic is a probability distribution over words. Each word w has a latent topic assignment z. Topic weights for each document are stored in θ d φ is a matrix of word-topic distributions. α and β are prior distributions. K N d M Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 6 / 32

UNSUPERVISED LATENT DIRICHLET ALLOCATION (LDA) α θ d z 3 z 1 z 4 z 5 z 6 z 7 z 2 w 3 w 1 w 4 w 5 w 6 w 7 w 2 M β φ K Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 7 / 32

Unsupervised Latent Dirichlet Allocation (LDA) UNSUPERVISED LATENT DIRICHLET ALLOCATION MODEL θ d Dirichlet α β Multinomial Dirichlet z φ Multinomial K w N d M Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 8 / 32

Unsupervised Latent Dirichlet Allocation (LDA) LDA AS A MATRIX FACTORIZATION MODEL Topics Documents {}}{ {}}{ k 1 k 2 k 3 d 1 d 2 d D v 1 0.50 0.11 0.10 v 1 8 1 1 1 4 v 2 0.44 0.27 0.34 v 2 5 12 0 0 1 Terms 1 0 1 0 1 = Terms 0 0 0 1 1 0 0 0 0 1 v W 0 0 0 0 1 v W 0.00 0.00 0.47 Documents {}}{ d 1 d 2 d D k 1 0.19 0.05 0.10 k 2 0.01 0.43 0.52 Topics k 3 0.03 0.15 0.24 Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 9 / 32

Supervised Latent Dirichlet Allocation (slda) SUPERVISED LATENT DIRICHLET ALLOCATION (SLDA) β φ K (η, σ 2 ) y d α θ d z w N d M slda [Mcauliffe and Blei, 2008] posits that documents correspond to mixture of topics. Topic is a probability distribution over words. Each word w has a latent topic assignment z. Topic weights for each document are stored in θ d φ is a matrix of word-topic distributions. α and β are prior distributions. Incorporates the document meta-data y d to generate topics, for example, classification labels, sentiment labels, etc. Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 10 / 32

Maximum Margin Models MAXIMUM MARGIN MODELS T 6 Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 11 / 32 T 5

Maximum Margin Models LINEAR MAXIMUM MARGIN MODELS Linear Maximum Margin Classifier minimize η 1 2 η 2 + C ξ ik subject to ξ ik 0 y u ik η (ψ u i ) 1 ξ ik, u, i, k, Pairwise Maximum Margin Classifier minimize η 1 2 η 2 + C ξ ik subject to ξ ik 0 y u ik η (ψ u i ψ u k ) 1 ξ ik, u, i, k, Notations ψ u i - Feature vector. C - Regularization parameter. η - Weight vector. yik u - Class label. ξ ik - Non-negative slack variables. Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 12 / 32

Maximum Margin Supervised Topic Models (MedLDA) MAXIMUM MARGIN SUPERVISED TOPIC MODELS (MEDLDA) Topic model for document classification [Zhu et al., 2012]. Regularizes the topic space with maximum margin constraints. Topic model parameter learning and maximum margin parameter learning are tightly integrated into a single model. Uses a latent linear discriminant function with a random weight vector. Encapsulates the topic feature weights. Learning is conducted iteratively: Maximum margin solver. Parameters of the topic model. Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 13 / 32

Maximum Margin Supervised Topic Models (MedLDA) AD-HOC RETRIEVAL USING LDA ([WEI AND CROFT, 2006]) Study the effectiveness of the LDA model in ad-hoc retrieval task. They proposed LDA-based document model within the language modeling framework. Basic Language Modeling for Ad-hoc IR P(w d) = N ( d Maximum Likelihood N d + µ P ML(w d) + 1 N ) d P ML (w collection) N d + µ Maximum Likelihood LDA-based Language Modeling for Ad-hoc IR ( ( Nd P(w d) = λ N d + µ P ML(w d)+ 1 N d N d + µ ) ) P ML (w collection) +(1 λ)(p LDA (w d)) K P LDA (w d, θ, φ)) = P(w z, φ)p(z θ, d) z=1 Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 14 / 32

OUR MODEL - PAIRWISE REGULARIZED BAYESIAN LEARNING-TO-RANK MODEL Our Model = Generative Learning + Discriminative Learning We propose an optimization algorithm to conduct learning-to-rank using topic models. We integrate the mechanism behind a basic topic model, LDA, with a maximum-margin pairwise learning-to-rank classifier. As a result, we obtain a unified learning-to-rank model, which conducts regularized Bayesian inference. In our model, the posterior distribution of the LDA model is regularized with a pairwise maximum margin classifier. This allows us to directly control the posterior distribution. Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 15 / 32

Posterior Regularization POSTERIOR REGULARIZATION - HIGH-LEVEL IDEA Likelihood Prior Bayes theorem P(M x) = P(x M)P(M) P(x M)P(M)dM Posterior Evidence of M given x Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 16 / 32

Bayesian Posterior Regularization BAYESIAN INFERENCE AS AN OPTIMIZATION PROBLEM Let M M be the model space. The objective is to find the posterior distribution which we intend to infer. The posterior distribution of the Bayes Theorem is the solution to the following optimization problem: Bayes Optimization Model minimize P(M) M Kullback- Leibler Divergence KL subject to P(M) P [P(M x 1, x 2,, x N ) P(M)] N n=1 log P(x n M)P(M)dM Constraints on the posterior distribution. Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 17 / 32

Bayesian Posterior Regularization BAYESIAN REGULARIZATION Bayes Optimization Model minimize KL [P(M x 1, x 2,, x N ) P(M)] P(M) M subject to P(M) P N n=1 log P(x n M)P(M)dM Bayesian Regularization minimize KL [P(M x 1, x 2,, x N ) P(M)] P(M) M subject to P(M) P(χ) N n=1 log P(x n M)P(M)dM+ Convex Function f (χ) Can encode rich information or knowledge. Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 18 / 32

LDA as a Regularized Bayesian Model LDA AS A REGULARIZED BAYESIAN MODEL Posterior of LDA LDA P(Θ, Z, Φ W, α, β) = P 0 (Θ, Z, Φ α, β)p(w Θ, Z, Φ), P(W α, β) where, P 0 (Θ, Z, Φ α, β) - Prior probability. w d = {w d n }Nd n=1 W = {w d } D d=1 z d = {z d n }Nd n=1 Z = {z d } D d=1 Θ = {θ d } D d=1 Φ = {φ 1, φ 2,, φ K } = V K - Matrix of topic distribution parameters α - Parameter of the Dirichlet prior on the per-document topic distributions β - Parameter of the Dirichlet prior on the per-topic word distributions Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 19 / 32 β φ K θ d z w N d D α

Regularized Bayesian Learning to Rank REGULARIZED BAYESIAN LEARNING TO RANK Design Details Combine ideas from pairwise maximum margin learning to rank with topic models. Convex function can be borrowed from the maximum margin pairwise learning to rank model. We choose the basic topic model - LDA Formed an aggregated document set consisting of queries and documents. Incorporate the latent topic information in the form of a latent topic similarity between the document and the query during learning. minimize KL[P(Θ, Z, Φ W, α, β) P 0(Θ, Z, Φ α, β)] E P [log P(W Θ, Z, Φ)] P(Θ,Z,Φ α,β) P,ξ }{{} LDA Model + M(ξ) }{{} Convex Function subject to P(Θ, Z, Φ α, β) P new(ξ), ξ 0 Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 20 / 32

Regularized Bayesian Learning to Rank FULL REGULARIZED BAYESIAN LEARNING TO RANK MODEL minimize KL[P(Θ, Z, Φ W, α, β) P 0 (Θ, Z, Φ α, β)] E P [log P(W Θ, Z, Φ)] P(Θ,Z,Φ α,β), η,η t,ξ + 1 2 ( η 2 + ηt 2 ) + C x 1 ξ u (i,k) x B u=1 u (i,k) B u }{{} Convex Function - Pairwise Maximum Margin subject to ξ u (i,k) hu i hk u y ik u [ η (ψi u ψk u ) + η t (Υ u i Υ u k ) ], u, i, k, }{{} Dynamic Topic Similarity P(Θ, Z, Φ α, β) P(ξ u (i,k) ) ξ u (i,k) 0, u, i, k. Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 21 / 32

Regularized Bayesian Learning to Rank MODELING ILLUSTRATION T 6 T 5 Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 22 / 32

Regularized Bayesian Learning to Rank SOLVING THE OPTIMIZATION MODEL Iterative Procedure 1 Monte Carlo method - Collapsed Gibbs sampling. 2 Iteratively compute: 1 Maximum margin separation of the points, i.e. we determine η and η t given P(Θ, Z, Φ W, α, β). 2 P(Θ, Z, Φ W, α, β) given (η, η t ). 3 Both steps are repeated for a given number of iterations or until the sampler converges to a steady state. P(Z α) Ω is the normalization constant. P(W, Z α, β) Ω } {{ } Topic model e 1 2 }{{ Ψ }. Pairwise classifier Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 23 / 32

Parameter Estimation SAMPLER FORMULATIONS ( K β Γ( s=1 P(Z α) βs) V ) β φ n zw+βw 1 z=1 s=1 Γ(βs) zw dφ z w=1 }{{} Expanding using Dirichlet - Word-Topic statistics ( D α Γ( s=1 αs) K ) α θ p dz +α z 1 dθ d e 2 1 d=1 s=1 Γ(αs) }{{ Ψ } z=1 }{{} Regularizer Expanding using Dirichlet - Document-Topic statistics Pairwise Maximum Margin Regularizer [ x ] x = λ u dk λûˆd [(ψ ˆk d u ψu k ) + (Υu d Υu k )] ) [(ψûˆd ψûˆk u=1 (d,k) B u û=1 (ˆd ˆk) Bû + ) ] (Υûˆd Υûˆk x Ψ = λ u dk (hu d hu k ) u=1 (d,k) B u Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 24 / 32

Parameter Estimation TRANSITION EQUATIONS ( ) P(zn d = t Z n, w = v, W n, α, β) n tv + β w d 1 [ n V v=1 n ] tv + β v 1 Posterior distributions are computed as: Building Word-Topic Matrix } {{ } Word-topic statistics (p dt + α t 1) }{{} Document-topic statistics e 1 2 }{{ Ψ } Regularizer φ zv = n zv + β v ( ) V v=1 nzv + βv Building Document-Topic Matrix θ dz = p dz + α z 1 ( ) e 2 Ψ K z=1 p dz + α z Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 25 / 32

EXPERIMENTS AND RESULTS Datasets We need original documents to build a topic model. Many standard learning to rank test collections do not come with original documents, except OHSUMED. Used Terrier Information Retrieval platform to compute query-document features. Experimented on standard TREC collections: 1 OHSUMED - 45 normalized features 2 AQUAINT - 39 normalized features 3 WT2G - 39 normalized features 4 ClueWeb-2009-91 normalized features Train-Validate-Test Splits Training set - approximately 60% of query-document pairs Testing set - approximately 20% of query-document pairs Validation set - approximately 20% of query-document pairs Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 26 / 32

Comparative Models and Evaluation Metric COMPARATIVE MODELS AND EVALUATION METRIC Comparative Models Listwise - Listnet, AdaRank-MAP, AdaRank-NDCG, Coordinate Ascent, LambdaMART, SVMMAP, and MART Pairwise - RankNet, RankBoost, RankSVM-Struct, and LambdaRank Pointwise - Linear Regression - L2 Norm, and Random Forests DirectRank - recently proposed [Tan et al., 2013]. Evaluation Metric Normalized Cumulative Discounted Gain (NDCG): W (q s) = 1 T n n i=1 2 r(i) 1 log(1 + i) T n - Normalization constant. r(i) - Rank label. n - Length of ranked list. Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 27 / 32

Background Introduction Related Topic Models Our Pairwise Maximum Margin Learning to Rank Model Experiments and Results Conclusions NDCG Results R ESULTS - NDCG@10 Models ListNet AdaRankNDCG AdaRankMAP SVMMAP Coordinate Ascent LambdaMART LambdaRank Linear Regression RankSVM RankBoost DirectRank MART RankNet Random Forests Semantic Search Our Model Shoaib Jameel et al., Cardiff University AQUAINT W ITHOUT TOPICS WT2G ClueWeb09 OHSUMED 0.217 0.292 0.188 0.489 0.331 0.211 0.427 0.431 0.253 0.486 0.212 0.426 0.274 0.276 0.469 0.512 0.333 0.275 0.408 0.448 0.299 0.061 0.213 0.463 0.266 0.511 0.211 0.224 0.214 0.408 0.238 0.430 0.288 0.262 0.295 0.256 0.262 0.281 0.465 0.478 0.496 0.479 0.438 0.471 0.345 0.292 0.381 0.235 0.363 0.266 0.309 0.421 0.432 0.425 0.439 0.433 0.255 0.481 0.367 0.423 0.301 0.531 0.394 0.454 Learning-to-rank with Topic Models October 12, 2015 28 / 32

Background Introduction Related Topic Models Our Pairwise Maximum Margin Learning to Rank Model Experiments and Results Conclusions NDCG Results R ESULTS - NDCG@10 Models ListNet AdaRankNDCG AdaRankMAP SVMMAP Coordinate Ascent LambdaMART LambdaRank Linear Regression RankSVM RankBoost DirectRank MART RankNet Random Forests Semantic Search Our Model Shoaib Jameel et al., Cardiff University AQUAINT W ITH TOPICS WT2G ClueWeb09 OHSUMED 0.217 0.296 0.194 0.496 0.344 0.224 0.431 0.434 0.253 0.489 0.232 0.431 0.278 0.276 0.472 0.515 0.348 0.275 0.415 0.449 0.299 0.060 0.213 0.465 0.268 0.514 0.245 0.244 0.244 0.415 0.245 0.431 0.291 0.262 0.295 0.295 0.262 0.281 0.469 0.478 0.503 0.481 0.441 0.475 0.372 0.301 0.386 0.266 0.369 0.311 0.318 0.425 0.433 0.426 0.441 0.439 0.255 0.489 0.367 0.423 0.301 0.531 0.394 0.454 Learning-to-rank with Topic Models October 12, 2015 29 / 32

NDCG Results QUERY-LEVEL PERFORMANCE 1 2 3 >3 AQUAINT 54.33 59.74 69.12 78.66 WT2G 59.64 67.35 73.92 73.92 ClueWeb 61.23 72.98 79.76 83.21 LETOR OHSUMED 57.34 63.65 69.44 75.61 Table: Query level performance in terms of the percentage of winning numbers for different query lengths. Winning Number is defined as follows n m W i = I(NDCG i (j) > NDCG k (j)) j=1 k=1 Notations n - Number of datasets. k - Indexes of compared models. Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 30 / 32

CONCLUSIONS Novel LTR model that combines latent topic information with maximum margin learning in a unified way. Latent topic representation is directly regularized with a pairwise maximum margin constraint. Topic similarity is computed in the regularized latent topic space. As our experiments show, such an integrated approach outperforms the existing two-stage de-coupled approach to incorporating latent topics in LTR models. Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 31 / 32

REFERENCES [Blei et al., 2003] Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet allocation. JMLR, 3:993 1022. [Mcauliffe and Blei, 2008] Mcauliffe, J. D. and Blei, D. M. (2008). Supervised topic models. In NIPS, pages 121 128. [Tan et al., 2013] Tan, M., Xia, T., Guo, L., and Wang, S. (2013). Direct optimization of ranking measures for learning to rank models. In KDD, pages 856 864. [Wei and Croft, 2006] Wei, X. and Croft, W. B. (2006). LDA-based document models for ad-hoc retrieval. In SIGIR, pages 178 185. [Zhu et al., 2012] Zhu, J., Ahmed, A., and Xing, E. P. (2012). MedLDA: Maximum margin supervised topic models. JMLR, 13(1):2237 2278. Shoaib Jameel et al., Cardiff University Learning-to-rank with Topic Models October 12, 2015 32 / 32