Introduction of Recruit

Size: px

Start display at page:

Download "Introduction of Recruit"

Josephine Barbra Jennings
6 years ago
Views:

1 Apr. 11, 2018

2 Introduction of Recruit We provide various kinds of online services from job search to hotel reservations across the world. Housing Beauty Travel Life & Local O2O Education Automobile Bridal & Baby Human Resources IT & Trends Media Dining 2

3 Introduction of Recruit We help users to find the best clients through our services. Data science plays an important role in the business. Internet Users Clients 3

Data Science at Recruit Recruit has hosted two data

International competitions of data mining www.kaggle.

Coupon Purchase Prediction (2015) We are passionate

place in KDD Cup 2015 { Engineers at Recruit (as of

4 Data Science at Recruit Recruit has hosted two data mining competitions in Kaggle Kaggle, KDD Cup: International competitions of data mining Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015) We are passionate about data science Some of us came in 1st and 2nd place in KDD Cup 2015 { Engineers at Recruit (as of March 2018)4 C Recruit Communications Co., Ltd.

Feature Selection: A Key Technique A key technique to win data mining competitions Find the most relevant features Balance bias-variance trade-off Features User 1 User 2 User 3

5 Feature Selection: A Key Technique A key technique to win data mining competitions Find the most relevant features Balance bias-variance trade-off Features User 1 User 2 User 3 User 4 Benefits Improve prediction Reduce computational cost User n-1 User n 5 Beating Kaggle the easy way studien/2015/dong_ying.pdf

6 Types of Feature Selection (FS) Algorithms Wrapper methods Iteratively evaluate a feature subset by black-box learning algorithm Embedded methods Train a model and select features at the same time Filter methods Features are selected by some criteria such as Mutual Information Independent on learning algorithms Can be used as a pre-processing 6

7 What is Mutual Information (MI)? Mutual Information I(X;Y) is a measure of the mutual independence between two random variables X and Y High Mutual Information I(X;Y) Low Able to predict Y given X Hard to predict Y given X MI can capture non-linear relationships unlike Pearson s correlation coefficient Shannon entropy Pearson r = 0.8 MI = 0.5 Pearson r = 0.0 MI = 0.7 Pearson r = 0.0 MI = Figures are retrieved from

8 Mutual Information based Feature Selection (MIFS) MIFS: using Mutual Information as a criteria in filter methods General formulation of MIFS MIFS selects a feature subset with a size of k which maximizes the Mutual Information (MI) between the features and the target variable 8

9 Heuristic MIFS Algorithms Max Relevance method Selecting the most relevant feature iteratively Repeat k times Mim Redundancy & Max Relevance method [1] (MRMR) Selecting the most relevant and least redundant feature iteratively Repeat k times 9 [1] H. Peng et al., 2005 [2] J. R. Vergara & P. A. Estévez, 2015

Our Contributions MI increase (%) w.r.t Linear MIFS optimization QUBO formulation of MIFS )06 2-4 1-0 Better (1) We reformulate MIFS by QUBO ( 5 6 7 8 10 15 #features 20 25 30 40 (2) We confirmed

10 Our Contributions MI increase (%) w.r.t Linear MIFS optimization QUBO formulation of MIFS ) Better (1) We reformulate MIFS by QUBO ( #features (2) We confirmed optimizations by D-Wave do well in MIFS QUBO: Quadratic Unconstrained Binary Optimization HOW? image is retrieved from 10 C Recruit Communications Co., Ltd.

11 Reformulation of MIFS by QUBO (1) Expand the MI term Proof. Theorem 1.1: Chain theorem for Conditional Mutual Information Using theorem 1.1, the following equation holds for all i S Averaging the equation above for all i leads to 11

12 Reformulation of MIFS by QUBO (2) Approximate under the assumption of Conditional Independence (CI) Proof. If we assume the conditional independence We can obtain 12

13 Reformulation of MIFS by QUBO (3) Optimization of MIFS QUBO formulation of MIFS MI Penalty for selecting only k features α: penalty strength 13

14 Interpretation of the Derived Formulation Expand the derived formulation Increase: Relevance, Complementary Reduce: Redundancy Relevance Redundancy Complementary 14

15 Comparison of Optimization Methods Problem Formulation Binary Quadratic Problem (BQP) Optimization Methods Linear Relaxation [1] (Linear) Truncated Power [1,2] (TPower) QUBO Tabu Search by qbsolv [3] D-Wave 2000Q 15 [1] H. Venkateswara, et al., 2015 [2] X. T. Yuan & T. Zhang, 2013 [3]

16 Linear Relaxation Method (Linear) Linearize the quadratic term by introducing new variables One of the optimal conditions is, which leads to Since Qij 0, the solution of this problem is given by k largest column sum of Q. This solution is tightly bounded [1]. Time complexity is O(nk). 16 [1] H. Venkateswara, et al., 2015

17 Truncated Power Method (TPower) Finding the largest k-sparse eigenvector of Q is defined as We select i th feature if xi > 0 This is calculated by the following procedure [1] [1] X. T. Yuan & T. Zhang, 2013 [2] H. Venkateswara, et al., 2015 Repeat T times This method is confirmed to be the best-performing method for BQP problem with non-negative matrix [2]. Time complexity of the algorithm is O(Tn 2 ). 17

Optimization by D-Wave Machine We used the D-Wave machine with the following settings Machine: D-Wave 2000Q Embedding: 64 bit full connection Annealing Time: 20µs Annealing

18 Optimization by D-Wave Machine We used the D-Wave machine with the following settings Machine: D-Wave 2000Q Embedding: 64 bit full connection Annealing Time: 20µs Annealing Repetitions: 10 When feature size n is larger than hardware size h (=64), we use Linear to narrow down the candidate features to h as a pre-processing. Full Connection Embedding for C(4,4,4) 18

19 Comparison of Mutual Information Score We compared MI scores of each optimization method for a public dataset. The increases with regard to Linear are shown in the graph below. Better MI increase (%) w.r.t Linear Mutual Information Score #features 19 ( ) Data Name: a1a #features: 122 #data points: 8000

20 Classification Accuracy We calculated the classification accuracy for different #features. Accuracy is a good measure to evaluate the quality of a selected subset of features. Original features Classification Accuracy Selected k-features Measure the classification accuracy by random forest classifiers 20

21 Classification Accuracy We evaluated each method by classification accuracy for different #features. Better Accuracy Classification Accuracy D-Wave TPower Tabu(qbsolv) Linear #features Better 21 Data Name: a1a #features: 122 #data points: 8000

22 Summary We derived the QUBO formulation of MIFS so that the problem can be embedded in Ising machines We used the D-Wave quantum annealing machine as a solver in MIFS The optimization method by D-Wave outperformed TPower which is the state-of-the-art optimization method for BQP We are planning to use MIFS by D-Wave in Kaggle! 22

23 Thank you for listening 23

24 Runtime of Optimizations method Linear TPower Tabu(qbsolv) D-Wave Averaege Runtime 9.0 msec 26.1 msec 14.3 sec 9.0 msec (Linear) μsec (annealing) Data Name: a1a #features: 122 #data points:

25 Comparison to MRMR, Max Rel. Accuracy D-Wave MRMR Max Rel #features Data Name: a1a #features: 122 #data points:

Iterative Laplacian Score for Feature Selection

Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006,