Click-Through Rate prediction: TOP-5 solution for the Avazu contest

Size: px

Start display at page:

Download "Click-Through Rate prediction: TOP-5 solution for the Avazu contest"

Elfrieda Berry
6 years ago
Views:

1 Click-Through Rate prediction: TOP-5 solution for the Avazu contest Dmitry Efimov Petrovac, Montenegro June 04, 2015

2 Outline Provided data Likelihood features FTRL-Proximal Batch algorithm Factorization Machines Final results

3 Competition

4 Provided data Device layer: id, model, type Time layer: day, hour Site layer: id, domain, category Basic features Connection layer: ip, type Banner layer: position, C1, C14-C21 Application layer: id, domain, category

5 Notations X: m n design matrix m train = m test = n = 23 y: binary target vector of size m x j : column j of matrix X x i : row i of matrix X 1 σ(z) = : sigmoid function 1 + e z

6 Evaluation Logarithmic loss for y i {0, 1}: L = 1 m m (y i log(ŷ i ) + (1 y i ) log(1 ŷ i )) i=1 ŷ i is a prediction for the sample i Logarithmic loss for y i { 1, 1}: L = 1 m m log(1 + e y i p i ) i=1 p i is a raw score from the model ŷ i = σ(p i ), i {1,..., m}

7 Feature engineering Blocks: combination of two or more features Counts: number of samples for different feature values Counts unique: number of different values of one feature for fixed value of another feature Likelihoods: min θt L, where θ t = P(y i x ij = t) Others

8 Feature engineering (algorithm 1) x11 x 1j1 x 1js x1n X : xm1 x mj1 x mjs xmn Step 1 x 1j1 x 1js x i1j1 x i1js Z : x ir x j1 ir js x mj1 x mjs Step 2 x i1j1 x i1js Z1 Zt : Z T (Z ) x ir j1 x ir js c i1 =... = c ir = r Step 3 p i1 =... = p ir = y y i1 ir r b i1 =... = b ir = t x i1k1 x i1kq At : function SPLITBYROWS(M) get partition of the matrix M into {M t } by rows such that each M t has identical rows Require: J = {j 1,..., j s} {1,..., n} K = {k 1,..., k q} {1,..., n}\j Z (x ij ) X, i {1, 2,..., m}, j J for I = {i 1,..., i r } SPLITBYROWS(Z ) c i1 =... = c ir = r p i1 =... = p ir = y i y ir r b i1 =... = b ir = t A t = (x ik ) X, i I, k K T (A t ) = size(splitbyrows(a t ) ) u i1 =... = u ir = T (A t ) x ir x k1 ir kq Step 4 B1... B T (At ) Step 5 u i1 =... = u ir = T (At)

9 Feature engineering (algorithm 2) function SPLITBYROWS(M) get partition of the matrix M into {M t } by rows such that each M t has identical rows Require: parameter α > 0 J (j 1,..., j s ) {1,..., n} increasing sequence V (v 1,..., v l ) {1,..., s}, v 1 < s f i y y m, i {1,..., m} m for v V do J v = {j 1,..., j v }, Z = (x ij ) X, i {1, 2,..., m}, j J v for I = {i 1,..., i r } SPLITBYROWS(Z ) do c i1 =... = c ir = r p i1 =... = p ir = y i y ir r w = σ( c + α) - weight vector f i = (1 w i ) f i + w i p i, i {1,..., m}

10 FTRL-Proximal model Weight updates: ( i w i+1 = arg min g r w + 1 w r=1 2 ( = arg min w w where i r=1 τ rj = ) i τ r w w r λ 1 w 1 = r=1 i (g r τ r w r ) w 2 2 r=1 β + i ( ) 2 grj r=1 α i r=1 + λ 2, j {1,..., N}, ) τ r + λ 1 w 1 + const, λ 1, λ 2, α, β - parameters, τ r = (τ r1,..., τ rn ) - learning rates, g r - gradient vector for the step r

11 FTRL-Proximal Batch model Require: parameters α, β, λ 1, λ 2 z j 0 and n j 0, j {1,..., N} for i = 1 to m do receive sample vector x i and let J = {j x ij 0} for j J do 0 if z j λ ( 1 ) w j = β + 1 nj ( ) + λ 2 zj sign(z j )λ 1 otherwise α predict ŷ i = σ(x i w) using the w j and observe y i if y i {0, 1} then for j J do g j = ŷ i y i - gradient direction of loss w.r.t. w j τ j = 1 ( n j + g j α 2 ) n j z j = z j + g j τ j w j n j = n j + g 2 j

12 Performance Description Leaderboard score dataset is sorted by app id, site id, banner pos, count1, day, hour dataset is sorted by app domain, site domain, count1, day, hour dataset is sorted by person, day, hour dataset is sorted by day, hour with 1 iteration dataset is sorted by day, hour with 2 iterations

13 Factorization Machine (FM) Second-order polynomial regression: n 1 ŷ = σ n w jk x j x k j=1 k=j+1 Low rank approximation (FM): n 1 n ŷ = σ (v j1,..., v jh ) (v k1,..., v kh )x j x k j=1 k=j+1 H is a number of latent factors

14 Factorization Machine for categorical dataset Assign set of latent factors for each pair level-feature: ŷ i = σ 2 n 1 n (w xij k1,..., w xij kh) (w xik j1,..., w xik jh) n j=1 k=j+1 Add regularization term: L reg = L λ w 2 The gradient direction: g xij kh = L reg w xij kh = 2 n y i e y i p i 1 + e y i p i w xik jh + λw xij kh ( ) 2 Learning rate schedule: τ xij kh = τ xij kh + g xij kh Weight update: w xij kh = w xij kh α τ xij kh g xij kh

15 Ensembling Model Description Leaderboard score ftrlb1 dataset is sorted by app id, site id, banner pos, count1, day, hour ftrlb2 dataset is sorted by app domain, site domain, count1, day, hour ftrlb3 dataset is sorted by person, day, hour fm factorization machine ens fm 0.6 ftrlb1 0.1 ftrlb2 0.2 ftrlb

16 Final results Place Team Leaderboard score Difference between the 1st place 1 4 Idiots Owen % 3 Random Walker % 4 Julian de Wit % 5 Dmitry Efimov % 6 Marios and Abhishek % 7 Jose A. Guerrero %

17 Future work apply the batching idea to the Factorization Machine algorithm find a better sorting for the FTRL-Proximal Batch algorithm find an algorithm that can find better sorting without cross-validation procedure

18 References H.Brendan McMahan et al. Ad click prediction: a view from the trenches. In KDD, Chicago, Illinois, USA, August Wei-Sheng Chin et al. A learning-rate schedule for stochastic gradient methods to matrix factorization. In PAKDD, Michael Jahrer et al. Ensemble of collaborative filtering and feature engineered models for click through rate prediction. In KDD Cup, 2012 Steffen Rendle. Social network and click-through prediction with factorization machines. In KDD Cup, 2012

19 Thank you! Questions??? Dmitry Efimov

Scaling Neighbourhood Methods

Quick Recap Scaling Neighbourhood Methods Collaborative Filtering m = #items n = #users Complexity : m * m * n Comparative Scale of Signals ~50 M users ~25 M items Explicit Ratings ~ O(1M) (1 per billion)