Click-Through Rate prediction: TOP-5 solution for the Avazu contest

Click-Through Rate prediction: TOP-5 solution for the Avazu contest Dmitry Efimov Petrovac, Montenegro June 04, 2015

Outline Provided data Likelihood features FTRL-Proximal Batch algorithm Factorization Machines Final results

Competition

Provided data Device layer: id, model, type Time layer: day, hour Site layer: id, domain, category Basic features Connection layer: ip, type Banner layer: position, C1, C14-C21 Application layer: id, domain, category

Notations X: m n design matrix m train = 40 428 967 m test = 4 577 464 n = 23 y: binary target vector of size m x j : column j of matrix X x i : row i of matrix X 1 σ(z) = : sigmoid function 1 + e z

Evaluation Logarithmic loss for y i {0, 1}: L = 1 m m (y i log(ŷ i ) + (1 y i ) log(1 ŷ i )) i=1 ŷ i is a prediction for the sample i Logarithmic loss for y i { 1, 1}: L = 1 m m log(1 + e y i p i ) i=1 p i is a raw score from the model ŷ i = σ(p i ), i {1,..., m}

Feature engineering Blocks: combination of two or more features Counts: number of samples for different feature values Counts unique: number of different values of one feature for fixed value of another feature Likelihoods: min θt L, where θ t = P(y i x ij = t) Others

Feature engineering (algorithm 1) x11 x 1j1 x 1js x1n X : xm1 x mj1 x mjs xmn Step 1 x 1j1 x 1js x i1j1 x i1js Z : x ir x j1 ir js x mj1 x mjs Step 2 x i1j1 x i1js Z1 Zt : Z T (Z ) x ir j1 x ir js c i1 =... = c ir = r Step 3 p i1 =... = p ir = y +... + y i1 ir r b i1 =... = b ir = t x i1k1 x i1kq At : function SPLITBYROWS(M) get partition of the matrix M into {M t } by rows such that each M t has identical rows Require: J = {j 1,..., j s} {1,..., n} K = {k 1,..., k q} {1,..., n}\j Z (x ij ) X, i {1, 2,..., m}, j J for I = {i 1,..., i r } SPLITBYROWS(Z ) c i1 =... = c ir = r p i1 =... = p ir = y i 1 +... + y ir r b i1 =... = b ir = t A t = (x ik ) X, i I, k K T (A t ) = size(splitbyrows(a t ) ) u i1 =... = u ir = T (A t ) x ir x k1 ir kq Step 4 B1... B T (At ) Step 5 u i1 =... = u ir = T (At)

Feature engineering (algorithm 2) function SPLITBYROWS(M) get partition of the matrix M into {M t } by rows such that each M t has identical rows Require: parameter α > 0 J (j 1,..., j s ) {1,..., n} increasing sequence V (v 1,..., v l ) {1,..., s}, v 1 < s f i y 1 +... + y m, i {1,..., m} m for v V do J v = {j 1,..., j v }, Z = (x ij ) X, i {1, 2,..., m}, j J v for I = {i 1,..., i r } SPLITBYROWS(Z ) do c i1 =... = c ir = r p i1 =... = p ir = y i 1 +... + y ir r w = σ( c + α) - weight vector f i = (1 w i ) f i + w i p i, i {1,..., m}

FTRL-Proximal model Weight updates: ( i w i+1 = arg min g r w + 1 w r=1 2 ( = arg min w w where i r=1 τ rj = ) i τ r w w r 2 2 + λ 1 w 1 = r=1 i (g r τ r w r ) + 1 2 w 2 2 r=1 β + i ( ) 2 grj r=1 α i r=1 + λ 2, j {1,..., N}, ) τ r + λ 1 w 1 + const, λ 1, λ 2, α, β - parameters, τ r = (τ r1,..., τ rn ) - learning rates, g r - gradient vector for the step r

FTRL-Proximal Batch model Require: parameters α, β, λ 1, λ 2 z j 0 and n j 0, j {1,..., N} for i = 1 to m do receive sample vector x i and let J = {j x ij 0} for j J do 0 if z j λ ( 1 ) w j = β + 1 nj ( ) + λ 2 zj sign(z j )λ 1 otherwise α predict ŷ i = σ(x i w) using the w j and observe y i if y i {0, 1} then for j J do g j = ŷ i y i - gradient direction of loss w.r.t. w j τ j = 1 ( n j + g j α 2 ) n j z j = z j + g j τ j w j n j = n j + g 2 j

Performance Description Leaderboard score dataset is sorted by app id, site id, banner pos, count1, day, hour 0.3844277 dataset is sorted by app domain, site domain, count1, day, hour 0.3835289 dataset is sorted by person, day, hour 0.3844345 dataset is sorted by day, hour with 1 iteration 0.3871982 dataset is sorted by day, hour with 2 iterations 0.3880423

Factorization Machine (FM) Second-order polynomial regression: n 1 ŷ = σ n w jk x j x k j=1 k=j+1 Low rank approximation (FM): n 1 n ŷ = σ (v j1,..., v jh ) (v k1,..., v kh )x j x k j=1 k=j+1 H is a number of latent factors

Factorization Machine for categorical dataset Assign set of latent factors for each pair level-feature: ŷ i = σ 2 n 1 n (w xij k1,..., w xij kh) (w xik j1,..., w xik jh) n j=1 k=j+1 Add regularization term: L reg = L + 1 2 λ w 2 The gradient direction: g xij kh = L reg w xij kh = 2 n y i e y i p i 1 + e y i p i w xik jh + λw xij kh ( ) 2 Learning rate schedule: τ xij kh = τ xij kh + g xij kh Weight update: w xij kh = w xij kh α τ xij kh g xij kh

Ensembling Model Description Leaderboard score ftrlb1 dataset is sorted by app id, site id, banner pos, count1, day, hour 0.3844277 ftrlb2 dataset is sorted by app domain, site domain, count1, day, hour 0.3835289 ftrlb3 dataset is sorted by person, day, hour 0.3844345 fm factorization machine 0.3818004 ens fm 0.6 ftrlb1 0.1 ftrlb2 0.2 ftrlb3 0.1 0.3810447

Final results Place Team Leaderboard score Difference between the 1st place 1 4 Idiots 0.3791384 2 Owen 0.3803652 0.32% 3 Random Walker 0.3806351 0.40% 4 Julian de Wit 0.3810307 0.50% 5 Dmitry Efimov 0.3810447 0.50% 6 Marios and Abhishek 0.3828641 0.98% 7 Jose A. Guerrero 0.3829448 1.00%

Future work apply the batching idea to the Factorization Machine algorithm find a better sorting for the FTRL-Proximal Batch algorithm find an algorithm that can find better sorting without cross-validation procedure

References H.Brendan McMahan et al. Ad click prediction: a view from the trenches. In KDD, Chicago, Illinois, USA, August 2013. Wei-Sheng Chin et al. A learning-rate schedule for stochastic gradient methods to matrix factorization. In PAKDD, 2015. Michael Jahrer et al. Ensemble of collaborative filtering and feature engineered models for click through rate prediction. In KDD Cup, 2012 Steffen Rendle. Social network and click-through prediction with factorization machines. In KDD Cup, 2012

Thank you! Questions??? Dmitry Efimov defimov@aus.edu