QuickScorer: a fast algorithm to rank documents with additive ensembles of regression trees

Size: px

Start display at page:

Download "QuickScorer: a fast algorithm to rank documents with additive ensembles of regression trees"

Meghan Richards
5 years ago
Views:

1 QuickScorer: a fast algorithm to rank documents with additive ensembles of regression trees Claudio Lucchese, Franco Maria Nardini, Raffaele Perego, Nicola Tonellotto HPC Lab, ISTI-CNR, Pisa, Italy & Tiscali SpA Salvatore Orlando Università Ca Foscari, Venice, Italy Rossano Venturini Università di Pisa, Pisa, Italy

2 Ranking (in web search) is computationally expensive and requires trade-offs between efficiency and efficacy to be devised

3 Additive ensembles of regression trees Query First step Second step Results Page(s) Base Ranker N docs Top Ranker K docs Document Index Features Learning to Rank Algorithm K. QuickScore: from X to 6.5X faster scoring

4 Additive ensembles of regression trees

5 The winner proposal used a linear combination of 12 ranking models, 8 of which were LambdaMART boosted tree models, having each up to 3,000 trees About 24,000 regression trees in total! [C. Burges, K. Svore, O. Dekel, Q. Wu, P. Bennett, A. Pastusiak and J. Platt, Microsoft Research] Additive ensembles of regression trees Yahoo! Learning to Rank Challenge

6 Query-Document feature set Process of Query-Document Scoring

7 Query-Document feature set Process of Query-Document Scoring

8 Query-Document feature set Process of Query-Document Scoring

9 Query-Document feature set Process of Query-Document Scoring

10 Query-Document feature set Process of Query-Document Scoring

11 Query-Document feature set Process of Query-Document Scoring

12 Query-Document feature set Process of Query-Document Scoring

13 Query-Document feature set Process of Query-Document Scoring

14 Query-Document feature set Exit leaf Process of Query-Document Scoring

15 Query-Document feature set Exit leaf Score += Process of Query-Document Scoring

16 number number number number of of of of trees = 1K 20K leaves = 4 64 docs = 3K-10K features = Additive ensembles of regression trees -

17 Exit leaf SoA: Struct+ Query-Document feature sets Naïve baseline Each tree node is represented by a C++ object containing the feature id, the associated threshold and the left and right pointers.

18 SoA: If-then-else Query-Document feature sets if (x[4] <= 50.1) { // recurses on the left subtree } else { // recurses on the right subtree if(x[3] <= -3.0) result = 0.4; else result = -1.4; }

19 Need to store the structure of the tree High branch misprediction rate Low cache hit ratio SoA: If-then-else Query-Document feature sets if (x[4] <= 50.1) { // recurses on the left subtree } else { // recurses on the right subtree if(x[3] <= -3.0) result = 0.4; else result = -1.4; }

20 16 docs F 1 F 2 F 3 F 4 F 5 F 6 F 7 int nodeid F 8 = 0; nodeid = nodes->children[x[nodes[nodeid].fid] > nodes[nodeid].theta]; nodeid = nodes->children[x[nodes[nodeid].fid] > nodes[nodeid].theta]; nodeid = nodes->children[x[nodes[nodeid].fid] > nodes[nodeid].theta]; nodeid = nodes->children[x[nodes[nodeid].fid] > nodes[nodeid].theta]; return scores[nodeid]; } :F 10.1:F 3 1 Query-Document feature sets double depth4(float* x, Node* nodes) { SoA: VPred [Asadi et Al. TKDE 2014]

21 QuickScore, a new eﬃcient algorithm for the interleaved traversal of additive ensembles of regression trees by means of simple logical bitwise operations

22 QuickScore: false and true nodes

23 Given the feature set, each node of a tree can be classified as True or False True Node False Node QuickScore: false and true nodes

24 BITVECTOR? ?? QuickScore: Single Tree Traversal

25 BITVECTOR? ?? QuickScore: Single Tree Traversal

26 BITVECTOR? ?? QuickScore: Single Tree Traversal

27 BITVECTOR? ?? QuickScore: Single Tree Traversal

28 BITVECTOR? ?? QuickScore: Single Tree Traversal

29 BITVECTOR? ?? QuickScore: Single Tree Traversal

30 BITVECTOR? ?? QuickScore: Single Tree Traversal

31 BITVECTOR? ?? QuickScore: Single Tree Traversal

32 ? ?? QuickScore: use of false nodes masks

33 BITVECTOR AND AND = ??? QuickScore: use of false nodes masks

34 BITVECTOR AND AND = ? ?? Few operations, insensitive to nodes processing order! QuickScore: use of false nodes masks

35 F offsets Read-only, sequential data access R/W, random access Read-only, sequential access bitvectors bitmasks v leaves f 0 increasing values f 1 f F 1 num. leaves num. leaves num. leaves num. leaves num.leaves num. trees num. leaves QuickScore: data structures

36 offsets bitvectors bitvectors v leaves f 0 increasing values Query-Document Features sets F 0 F 1 F 2 F 3 F 4 F 5 F 6 F F F +1 f 1 f F 1 f F 1 num. leaves num. leaves num. leaves num. leaves num.leaves num. trees num. leaves num. leaves QuickScore: interleaved tree traversals

37 offsets bitvectors bitvectors v leaves f 0 increasing values Query-Document Features sets F 0 F 1 F 2 F 3 F 4 F 5 F 6 F F F +1 f 1 f F 1 f F 1 num. leaves num. leaves num. leaves num. leaves num.leaves num. trees num. leaves num. leaves QuickScore: interleaved tree traversals

38 offsets bitvectors bitvectors offsets v leaves f 0 increasing values Query-Document Features sets F 0 F 1 F 2 F 3 F 4 F 5 F 6 F F F +1 f 1 f F 1 f F 1 num. leaves num. leaves num. leaves num. leaves num.leaves num. trees num. leaves num. leaves QuickScore: interleaved tree traversals

39 offsets bitvectors bitvectors offsets v leaves f 0 increasing values Query-Document Features sets F 0 F 1 F 2 F 3 F 4 F 5 F 6 F F F +1 f 1 f F 1 f F 1 num. leaves num. leaves num. leaves num. leaves num.leaves num. trees num. leaves num. leaves Low branch misprediction rate High cache hit ratio QuickScore: interleaved tree traversals

40 Experimental Settings Lambda-MART ranking models optimizing learned with RankLib from MSN and Yahoo LETOR datasets Ensembles with 1K, 5K, 10K, or 20K regression trees, each with up to 8, 16, 32, or 64 leaves Intel Core 3.50Ghz CPU, with 32GB RAM, Ubuntu Linux

Experimental Results Per-document scoring time in microsecs and speedups Table 2: Per-document scoring time in µs of QS, VPred, If-Then-Else and Struct+ on MSN-1 and Y!S1 datasets.

41 Experimental Results Per-document scoring time in microsecs and speedups Table 2: Per-document scoring time in µs of QS, VPred, If-Then-Else and Struct+ on MSN-1 and Y!S1 datasets. Gain factors are reported in parentheses. Method 1, 000 MSN-1 5, 000 Y!S1 MSN-1 Number of trees/dataset 10, 000 Y!S1 MSN-1 QS 2.2 ( ) 4.3 ( ) 10.5 ( ) 14.3 ( ) VPred 7.9 (3.6x) 8.5 (x) 40.2 (3.8x) 41.6 (2.9x) 8 If-Then-Else 8.2 (3.7x) 10.3 (2.4x) 81.0 (7.7x) 85.8 (6.0x) Struct (9.6x) 23.1 (5.4x) (10.3x) (7.9x) QS 2.9 ( ) 6.1 ( ) 16.2 ( ) 22.2 ( ) VPred 16.0 (5.5x) 16.5 (2.7x) 82.4 (5.0x) 82.8 (3.7x) 16 If-Then-Else 18.0 (6.2x) 21.8 (3.6x) (7.8x) (5.8x) Struct (14.7x) 41.0 (6.7x) (26.2x) (18.2x) QS 5.2 ( ) 9.7 ( ) 2 ( ) 34.3 ( ) VPred 31.9 (6.1x) 31.6 (3.2x) (6.0x) (4.7x) 32 If-Then-Else 34.5 (6.6x) 36.2 (3.7x) (11.1x) (8.0x) Struct (13.3x) 67.4 (6.9x) (34.2x) (24.3x) QS 9.5 ( ) 15.1 ( ) 56.3 ( ) 66.9 ( ) VPred 62.2 (6.5x) 57.6 (3.8x) (6.3x) (5.0x) 64 If-Then-Else 55.9 (5.9x) 55.1 (3.6x) (16.6x) (14.0x) Struct (11.6x) (7.7x) (29.5x) (23.2x) same trivially holds for Struct+. This means that the interleaved traversal strategy of QS needs to process less nodes than in a traditional root-to-leaf visit. This mostly explains the results achieved by QS. As far as number of branches is concerned, we note that, not surprisingly, QS and VPred are much more efficient than If-Then-Else and Struct+ with this respect. QS has a larger total number of branches than VPred, which uses scoring functions that are branch-free. However, those 20.0 ( ) 80.5 (4.0x) (9.3x) (18.7x) 32.4 ( ) (5.1x) (19.0x) (37.6x) 59.6 ( ) (5.7x) (23.4x) (30.3x) ( ) (4.7x) (15.9x) (19.3x) 20, 000 Y!S1 MSN-1 Y!S ( ) 82.7 (3.3) (7.3x) (15.4x) 41.2 ( ) (4.0x) (9.9x) (28.9x) 70.3 ( ) (4.8x) (19.8x) (25.2x) ( ) (4.4x) (15.2x) (18.4x) 40.5 ( ) (4.0x) (17.5x) (28.4x) 67.8 ( ) (4.9x) (26.0x) (38.2x) ( ) (4.5x) (20.4x) (29.6x) ( ) (3.0x) 466 (11.0x) (12.8x) 48.1 ( ) (3.4x) (16.0x) (23.7x) 81.0 ( ) (4.1x) (21.1x) (32.4x) ( ) (4.3x) (19.4x) (27.0x) ( ) (4.1x) (14.0x) (15.9x) number of L3 cache misses starts increasing when dealing with 9, 000 trees on MSN and 8, 000 trees on Y!S1. BWQS: a block-wise variant of QS The previous experiments suggest that improving the cache efficiency of QS may result in significant benefits. As in Tang et al. [12], we can split the tree ensemble in disjoint blocks of size that are processed separately in order to let the corresponding data structures fit into the faster levels of

42 on MSN-1 with 64-leaves -MART models. Method Number of Trees 1, 000 5, , , , 000 Instruction Count QS VPred If-Then-Else Struct Num. Visited Nodes (above) Visited Nodes/Total Nodes (below) QS % 21% 25% 26% 29% VPred % 89% 89% 88% 77% Struct If-Then-Else 64% 62% 59% 57% 50% MSN-1 with 64-leaves λ-mart models. Per-tree per-document low-level statistics on

43 MSN-1: Scoring Time and Cache Misses Scoring time per document per tree (µs) MSN-1 QS Scoring Time QS Cache Misses Number of Trees (64 leaves) Cache Misses (%)

44 MSN-1: Scoring Time and Cache Misses Scoring time per document per tree (µs) MSN-1 QS Scoring Time QS Cache Misses We can split the tree ensemble 40 in disjoint blocks p r o c e 30 s s e d Number of Trees (64 leaves) Cache Misses (%) s e p a r a t e l y i n order to 20 let the c o r r e s p o n d i n g data structures fit 10 into the faster l e v e l s o f t h e 0 memory hierarchy.

45 MSN-1: Scoring Time and Cache Misses Scoring time per document per tree (µs) QS Scoring Time BWQS Scoring Time QS Cache Misses BWQS Cache Misses The block-wise version outperforms QS up to 55% 10 (MSN, 20K trees, 64 leaves) Cache Misses (%) Number of Trees (64 leaves)

46 Thank you!

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results Claudio Lucchese Ca Foscari University of Venice clucches@dsi.unive.it Raffaele Perego ISTI-CNR of Pisa perego@isti.cnr.it Salvatore