Stochastic Top-k ListNet

Similar documents
Listwise Approach to Learning to Rank Theory and Algorithm

Position-Aware ListMLE: A Sequential Learning Process for Ranking

An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

ECE521 Lectures 9 Fully Connected Neural Networks

Decoupled Collaborative Ranking

Gradient Descent Optimization of Smoothed Information Retrieval Metrics

Machine Learning Basics

ConeRANK: Ranking as Learning Generalized Inequalities

Information Retrieval

Information Retrieval

Greedy function optimization in learning to rank

Machine Learning for Search Ranking and Ad Auctions. Tie-Yan Liu Senior Researcher / Research Manager Microsoft Research

11. Learning To Rank. Most slides were adapted from Stanford CS 276 course.

Ad Placement Strategies

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Large-scale Linear RankSVM

Overview. More counting Permutations Permutations of repeated distinct objects Combinations

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Logistic Regression. COMP 527 Danushka Bollegala

Playing with Numbers. introduction. numbers in general FOrM. remarks (i) The number ab does not mean a b. (ii) The number abc does not mean a b c.

Logistic Regression & Neural Networks

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Warm up: risk prediction with logistic regression

Lecture 2: Probability, conditional probability, and independence

Neural Networks: Backpropagation

Large-Margin Thresholded Ensembles for Ordinal Regression

Best reply dynamics for scoring rules

ACML 2009 Tutorial Nov. 2, 2009 Nanjing. Learning to Rank. Hang Li Microsoft Research Asia

Permutation. Permutation. Permutation. Permutation. Permutation

Maxout Networks. Hien Quoc Dang

6.036 midterm review. Wednesday, March 18, 15

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Learning to Disentangle Factors of Variation with Manifold Learning

SQL-Rank: A Listwise Approach to Collaborative Ranking

Learning to Rank by Optimizing NDCG Measure

arxiv: v2 [cs.ir] 22 Feb 2018

CS246 Final Exam. March 16, :30AM - 11:30AM

Lecture 3: Probability

CSC321 Lecture 4: Learning a Classifier

Classification of Ordinal Data Using Neural Networks

Neural networks and support vector machines

Large margin optimization of ranking measures

arxiv: v3 [cs.lg] 6 Sep 2017

University of Technology, Building and Construction Engineering Department (Undergraduate study) PROBABILITY THEORY

A Statistical View of Ranking: Midway between Classification and Regression

Final Examination CS 540-2: Introduction to Artificial Intelligence

Math for Liberal Studies

CSC321 Lecture 4: Learning a Classifier

STA141C: Big Data & High Performance Statistical Computing

ECS289: Scalable Machine Learning

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Prediction of Citations for Academic Papers

Exploiting Geographic Dependencies for Real Estate Appraisal

Style-aware Mid-level Representation for Discovering Visual Connections in Space and Time

CSE446: non-parametric methods Spring 2017

Collaborative topic models: motivations cont

Nontransitive Dice and Arrow s Theorem

On NDCG Consistency of Listwise Ranking Methods

Filter Methods. Part I : Basic Principles and Methods

Discrete Latent Variable Models

Statistical learning theory for ranking problems

CS60021: Scalable Data Mining. Large Scale Machine Learning

Natural Language Processing (CSEP 517): Text Classification

(i) The optimisation problem solved is 1 min

Kernel Learning via Random Fourier Representations

ABC-LogitBoost for Multi-Class Classification

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

Neural Network Training

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Optimization for Machine Learning

Machine Learning for NLP

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Convolutional Neural Networks

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

CPSC 340: Machine Learning and Data Mining

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Neural Networks: Backpropagation

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Action-Decision Networks for Visual Tracking with Deep Reinforcement Learning

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Perceptron (Theory) + Linear Regression

CS-E4830 Kernel Methods in Machine Learning

Memory-Augmented Attention Model for Scene Text Recognition

Predictive analysis on Multivariate, Time Series datasets using Shapelets

CSCI-567: Machine Learning (Spring 2019)

Voting. José M Vidal. September 29, Abstract. The problems with voting.

Jakub Hajic Artificial Intelligence Seminar I

STA141C: Big Data & High Performance Statistical Computing

Least Mean Squares Regression

Convergence rate of SGD

Statistical Ranking Problem

Generative MaxEnt Learning for Multiclass Classification

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Instance-based Domain Adaptation

Transcription:

Stochastic Top-k ListNet Tianyi Luo, Dong Wang, Rong Liu, Yiqiao Pan CSLT / RIIT Tsinghua University lty@cslt.riit.tsinghua.edu.cn EMNLP, Sep 9-19, 2015 1

Machine Learning Ranking Learning to Rank Information Retrieval (From Tieyan Liu s tutorial) 2

Score function(d i : i -th document; q : query): f d i, q = ω 1 feature1 d i + ω 2 feature2 d i Documents: d 1 (relevant);d 2 (irrelevant). And we set ω 1 = 1, ω 2 = 1. f d 1 =1*3+1*4=7; f d 2 =1*5+1*3=8. (relevant)f d 1 = 7 < 8 = f d 2 (irrelevant) No good rank! 3

Score function(d i : i -th document; q : query): f d i, q = ω 1 feature1 d i + ω 2 feature2 d i Documents: d 1 (relevant);d 2 (irrelevant). Through learing to rank, we get new weights ω 1 = 0.1, ω 2 = 0.9. f d 1 =0.1*3+0.9*4=3.9; f d 2 =0.1*5+0.9*3=3.2. (relevant)f d 1 = 3.9 > 3.2 = f d 2 (irrelevant) Good rank! 4

Document retrieval Question answering Text summarization Machine translation Online advertising Collaborative filtering 5

Categorization of the Learning to Rank algorithms Pointwise Approach (e.g. McRank, Pranking) Pairwise Approach (e.g. Ranking SVM, RankBoost, Frank) map Listwise Approach (e.g. ListNet, ListMLE, AdaRank, SVM ) 6

Categorization of the Learning to Rank algorithms Pointwise Approach (e.g. McRank, Pranking) Pairwise Approach (e.g. Ranking SVM, RankBoost, Frank) Listwise Approach (e.g. ListNet, ListMLE, AdaRank, SVM ) Listwise learning delivers better performance in general than pointwise and pairwise learning. (Liu, 2009) 6 map

Winning numbers of 7 different learning to rank algorithms S i M = 7 7 I {Mi j >M k j } j=1 k=1 Algorithm N@1 N@3 N@10 P@1 P@3 P@10 MAP Regression 4 4 4 5 5 5 4 RankSVM 21 22 22 21 22 22 24 RankBoost 18 22 22 17 22 23 19 FRank 18 19 18 18 17 23 15 ListNet 29 31 33 30 32 35 33 AdaRank 26 25 26 23 22 16 27 map SVM 23 24 22 25 20 17 25 7

Winning numbers of 7 different learning to rank algorithms S i M = 7 7 I {Mi j >M k j } j=1 k=1 Algorithm N@1 N@3 N@10 P@1 P@3 P@10 MAP Regression 4 4 4 5 5 5 4 RankSVM 21 22 22 21 22 22 24 RankBoost 18 22 22 17 22 23 19 FRank 18 19 18 18 17 23 15 ListNet 29 31 33 30 32 35 33 AdaRank 26 25 26 23 22 16 27 map SVM 23 24 22 25 20 17 25 7

Conventional ListNet: too complex to use in practice 8

Conventional ListNet: too complex to use in practice Top-k ListNet: Top-k ListNet is a optimal algorithm of conventional ListNet However, when k is large, the complexity is still high. When k is small, a lot of ranking information is lost. 8

Conventional ListNet: too complex to use in practice Top-k ListNet: Top-k ListNet is a optimal algorithm of conventional ListNet. However, when k is large, the complexity is still high. When k is small, a lot of ranking information is lost. Goal: Reduce the computational complexity of Top-k ListNet 8

Top-1 ListNet implemented by Cao et al. (2007) 9

Top-1 ListNet implemented by Cao et al. (2007) Top-1 ListNet implemented by Dang (2013) 9

Top-1 ListNet implemented by Cao et al. (2007) Top-1 ListNet implemented by Dang (2013) Pairwise l2r sampling implemented by Sculley et al. (2009) 9

Top-1 ListNet implemented by Cao et al. (2007) Top-1 ListNet implemented by Dang (2013) Pairwise l2r sampling implemented by Sculley et al. (2009) Our work: Apply sampling methods in Top-k ListNet and publish the code 9

Challenges and Motivation Model Experiments Analysis Conclusion 10

Motivation Model Experiments Discussion Conclusion Non-measure specific listwise ranking Ranked list Permutation probability distribution Permutation and ranked list: 1-1 correspondence 11

Non-measure specific listwise ranking Ranked list Permutation probability distribution Permutation and ranked list: 1-1 correspondence f: f(a)=3,f(b)=0,f(c)=1; Probability distribution of permutations of ABC 11

p(π) Motivation Model Experiments Discussion Conclusion Non-measure specific listwise ranking Ranked list Permutation probability distribution Permutation and ranked list: 1-1 correspondence f: f(a)=3,f(b)=0,f(c)=1; Probability distribution of permutations of ABC ABC ACB BAC BCA CAB CBA π 11

Probability of a permutation is defined with Luce model: m P π f = j=1 φ f x π j k=j m φ f x π k Example: P ABC f = φ f A φ f A +φ f B +φ f C φ f B φ f B +φ f C φ f C φ f C 12

Probability of a permutation is defined with Luce model: m P π f = j=1 φ f x π j k=j m φ f x π k Example: P ABC f = φ f A φ f A +φ f B +φ f C φ f B φ f B +φ f C φ f C φ f C P(A ranked No.1) 12

Probability of a permutation is defined with Luce model: m P π f = j=1 φ f x π j k=j m φ f x π k Example: P ABC f = φ f A φ f A +φ f B +φ f C φ f B φ f B +φ f C φ f C φ f C P(A ranked No.1) P(B ranked No.2 A ranked No.1) 12

Probability of a permutation is defined with Luce model: m P π f = j=1 φ f x π j k=j m φ f x π k Example: P ABC f = φ f A φ f A +φ f B +φ f C φ f B φ f B +φ f C φ f C φ f C P(A ranked No.1) P(B ranked No.2 A ranked No.1) P(C ranked No.3 A ranked No.1, B ranked No.2) 12

p(π) Motivation Model Experiments Discussion Conclusion f: f(a)=3,f(b)=0,f(c)=1; Probability distribution of permutations of ABC ABC ACB BAC BCA CAB CBA π 13

p(π) Motivation Model Experiments Discussion Conclusion f: f(a)=3,f(b)=0,f(c)=1; Probability distribution of permutations of ABC ABC ACB BAC BCA CAB CBA π Luce model is continuous, differentiable, and concave. As shown above, P ACB is largest and P BCA is smallest. Translation invariant, e.g. φ s = exp s Scale invariant, e.g. φ s = a s 13

Loss function based on KL-divergence between two permutation probability distributions(φ = exp) L = i g G P y i g log P z i g Probability distribution Probability distribution defined by the ground truth defined by the model output Model = Neural Network Algorithm = Stochastic Gradient Descent 14

Loss function of conventional ListNet consider all the permutations: L = i g G P y i f ω g log P z i f ω g 15

Loss function of conventional ListNet consider all the permutations: L = i g G P y i f ω g log P z i f ω g L = i g G k P y i f ω g log P z i f ω g 15

Loss function of conventional ListNet consider all the permutations: L = i g G P y i f ω g log P z i f ω g L = P y i f ω g log P z i f ω g i g G k where G k j 1, j 2, j k = π Ω n π t = j t, t = 1,2,, k Conventional ListNet = Top k ListNet when k = n 15

Loss function of conventional ListNet consider all the permutations: L = i g G P y i f ω g log P z i f ω g L = i g G k P y i f ω g log P z i f ω g L = i Sample l times from G k Our P y i f ω g log P z i f ω g where G k j 1, j 2, j k = π Ω n π t = j t, t = 1,2,, k 15 work!

Challenge1: Computational complexity 16

Challenge1: Computational complexity Reduce the complexity for query m: n m! -> n m! (n m k)! However, when k is large, the complexity is still high. When k is small, a lot of ranking information is lost. n m : Number of documents of query m k: Order of Top-k When existing queries which have lots of candidates, k=2 is large. 16

Challenge1: Computational complexity Reduce the complexity for query m: n m! -> n m! (n m k)! However, when k is large, the complexity is still high. When k is small, a lot of ranking information is lost. When existing queries which have lots of candidates, k=2 is large. Challenge2: Data Imbalance n m : Number of documents of query m k: Order of Top-k 16

Challenge1: Computational complexity Reduce the complexity for query m: n m! -> n m! (n m k)! However, when k is large, the complexity is still high. When k is small, a lot of ranking information is lost. When existing queries which have lots of candidates, k=2 is large. Challenge2: Data Imbalance Many irrelevant documents exist in benchmarks e.g. LETOR. More list containing relevant documents should be considered. 16 n m : Number of documents of query m k: Order of Top-k

Challenges: Computational complexity Data Imbalance 17

Challenges: Computational complexity Data Imbalance Our solution: Sampling and Resampling! 17

p(π) To overcome Challenge 1: Computational complexity Sample a small set of the Top-k permutation lists. Impose a bound of the computation cost. ABC ACB BAC BCA CAB CBA π 18

p(π) To overcome Challenge 1: Computational complexity Sample a small set of the Top-k permutation lists. Impose a bound of the computation cost. ABC ACB BAC BCA CAB CBA π 18

Three sampling methods (e.g. k = 3): Uniform distribution sampling Fixed distribution sampling Adaptive distribution sampling 19

p(π) Motivation Model Experiments Discussion Conclusion Three sampling methods (e.g. k = 3): Uniform distribution sampling 1. Sampling l lists(every list is selected with the same probability; 2. Probability distribution doesn t change every time. ABC ACB BAC BCA CAB CBA π 20

Three sampling methods (e.g. k = 3): Uniform distribution sampling: 1. Sampling l lists(every list is selected with the same probability; 2. Probability distribution doesn t change every time. Probability Distributioon: ABC:1/6 ACB:1/6 BAC:1/6 BCA:1/6 CAB:1/6 CBA:1/6 Probability Distributioon: ABC:1/6 ACB:1/6 BAC:1/6 BCA:1/6 CAB:1/6 CBA:1/6 Probability Distributioon: ABC:1/6 ACB:1/6 BAC:1/6 BCA:1/6 CAB:1/6 CBA:1/6 One sample list: ABC One sample list: ABC BAC 21 One sample list: ABC Permutations BAC will be selected CBA randomly.

p(π) Motivation Model Experiments Discussion Conclusion Three sampling methods (e.g. k = 3): Fixed distribution sampling ABC ACB BAC BCA CAB CBA π 1. Sampling l lists(every list is selected with the different probabilities; 2. Different probabilities are based on different labeled relevant degrees; 3. Probability distribution doesn t change every time. 22

Three sampling methods (e.g. k = 3): Fixed distribution sampling: 1. Sampling l lists(every list is selected with the differentprobabilities; 2. Different probabilities are based on different labeled relevant degrees; 3. Probability distribution doesn t change every time. Probability Distributioon: ABC:1/4 ACB:1/3 BAC:1/9 BCA:1/24 CAB:1/6 CBA:1/8 Probability Distributioon: ABC:1/4 ACB:1/3 BAC:1/9 BCA:1/24 CAB:1/6 CBA:1/8 Probability Distributioon: ABC:1/4 ACB:1/3 BAC:1/9 BCA:1/24 CAB:1/6 CBA:1/8 One sample list: ACB One sample list: ACB ABC 22 One sample list: ACB Permutation ABC which has larger CAB probability will be selected easily.

p(π) Motivation Model Experiments Discussion Conclusion Three sampling methods (e.g. k = 3): Adaptive distribution sampling ABC ACB BAC BCA CAB CBA π 1. Sampling l lists(every list is selected with the different probabilities; 2. Different probabilities are based on different output scores; 3. Probability distribution adaptively changes every time. 23

Three sampling methods (e.g. k = 3): Adaptive distribution sampling: 1. Sampling l lists(every list is selected with the differentprobabilities; 2. Different probabilities are based on different output scores; 3. Probability distribution adaptively changes every time. Probability Distributioon: ABC:1/4 ACB:1/3 BAC:1/9 BCA:1/24 CAB:1/6 CBA:1/8 Probability Distributioon: ABC:10/36 ACB:9/24 BAC:1/12 BCA:1/36 CAB:13/72 CBA:1/12 Probability Distributioon: ABC:7/24 ACB:7/18 BAC:7/72 BCA:1/72 CAB:7/36 CBA:5/72 One sample list: ACB One sample list: ACB ABC 23 One sample list: ACB Permutation ABC which has larger CAB probability will be selected easily.

To overcome Challenge 2: Data Imbalance Re-sampling methods(e.g. k = 3): k i=1 ks s i 24

To overcome Challenge 2: Data Imbalance Re-sampling methods(e.g. k = 3): k i=1 ks s i s i : The value of the human-labelled score of the i document selected. S : The maximum value of the humanlabelled score, which is 2 in our case. 24

To overcome Challenge 2: Data Imbalance Re-sampling methods(e.g. k = 3): k i=1 ks s i s i : The value of the human-labelled score of the i document selected. S : The maximum value of the humanlabelled score, which is 2 in our case. The probability that a list is retained. 24

To overcome Challenge 2: Data Imbalance Re-sampling methods(e.g. k = 3): k i=1 ks s i s i : The value of the human-labelled score of the i document selected. S : The maximum value of the humanlabelled score, which is 2 in our case. The probability that a list is retained. Give one example(a:2 very relevant;b:1 relevant;c:0 irrelevant): k i=1 Given a. s 1 = 2, s 2 = 1, s 3 = 0. 24 ks s i = 2+1+0 3 2 =1 2

Re-sampling steps: Calculate the probability of ABC. It is 1/2 based on last slide. Randomly generate a value in (0,1) e.g. 1/6. If 1/6 < 1/2, list ABC is retained to conduct training. Re-sampling make lists containing more relevant documents trained 25

When k=1: ω = j σ z i, j σ y i, j x j i Where σ s, j is the j-th value of the softmax function of the score vector s, given by: σ s i, j = i es j n i t=1 e s i t 26

When k>1: ω = g Gk k t=1 σ y i k, t f=1 x jf i n i v=f σ z i, v x jf i Where σ defines a partial softmax function of the score vector s, given by: σ s i, f = i es f n i t=f e s i t 27

Dataset LETOR 4.0 (Liu et al., 2007) We use MQ2008 dataset in the LETOR 4.0 Training set, validation set and test data all contain 784 queries. Learning rate: 10 3 for k = 1; 10 5 for k > 1. 28

k = 1: Figure 1: The P@1 performance on the test data with the Top-1 ListNet utilizing the three sampling approaches. The size of the permutation subset varies from 50 to 500. 29

k = 2: Figure 2: The P@1 performance on the test data with the Top-2 ListNet utilizing the three sampling approaches. The size of the permutation subset varies from 5 to 500. 30

k = 3: Figure 3: The P@1 performance on the test data with the Top-3 ListNet utilizing the three sampling approaches. The size of the permutation subset varies from 5 to 500. 31

k = 4: Figure 4: The P@1 performance on the test data with the Top-4 ListNet utilizing the three sampling approaches. The size of the permutation subset varies from 5 to 500. 32

Larger k: Figure 5: The P@1 performance on the test data with the stochastic Top-k ListNet approach, where k varies from 1 to 100. 33

P@1 and P@10: Table 1: Averaged training time (in seconds), P@1 and P@10 on training, validation (Val.) and test data with different Top-k methods. C-ListNet stands for conventional ListNet, S-ListNet stands for stochastic ListNet. 34

P@1 and P@10: 1000+ times! Table 1: Averaged training time (in seconds), P@1 and P@10 on training, validation (Val.) and test data with different Top-k methods. C-ListNet stands for conventional ListNet, S-ListNet stands for stochastic ListNet. 34

Better measures of document scores Good measures of rank but not good measures of relevance. Better measures can be learnt and make performance better. Harsh labelling of the MQ 2008 dataset Documents are labelled by only three values {0, 1, 2}. Training data containing many documents with the same scores. Other Listwise algorithms can utilize sampling methods 35

Summary: Proposed a stochastic ListNet method. Speed up the training and improve the rank performance. Future work: Apply our method to other datasets which contains more human labels. Apply our method to other listwise learning to rank methods. Our code: https://github.com/pkuluotianyi/topkstolistnet 36

Thank you! Presented by Tianyi Luo EMNLP, Sep 9-19, 2015 37