Click-Through Rate prediction: TOP-5 solution for the Avazu contest

Similar documents
Scaling Neighbourhood Methods

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Large-scale Collaborative Ranking in Near-Linear Time

Matrix Factorization and Factorization Machines for Recommender Systems

Neural Networks: Backpropagation

Machine Learning Basics III

Phasing via the Expectation Maximization (EM) Algorithm

Collaborative Filtering Applied to Educational Data Mining

Multiclass Logistic Regression

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Restricted Boltzmann Machines

Machine Learning, Fall 2012 Homework 2

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Machine Learning for NLP

Deep Boltzmann Machines

Machine Learning for NLP

Collaborative topic models: motivations cont

Factor Modeling for Advertisement Targeting

ECE521 week 3: 23/26 January 2017

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Nonnegative Matrix Factorization

Linear Models for Regression CS534

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Large-Scale Behavioral Targeting

Logistic Regression. Stochastic Gradient Descent

Linear and logistic regression

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Machine Learning 2017

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Decoupled Collaborative Ranking

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Stochastic Gradient Descent

Restricted Boltzmann Machines for Collaborative Filtering

Linear Models for Regression CS534

Prediction of Citations for Academic Papers

Statistical Machine Learning from Data

Lecture 6. Regression

Logistic Regression. COMP 527 Danushka Bollegala

Preliminaries. Data Mining. The art of extracting knowledge from large bodies of structured data. Let s put it to use!

Solutions. Part I Logistic regression backpropagation with a single training example

Machine Learning Final Exam May 5, 2015

Techniques for Dimensionality Reduction. PCA and Other Matrix Factorization Methods

Linear Models for Regression. Sargur Srihari

Large-Scale Social Network Data Mining with Multi-View Information. Hao Wang

Convolutional Neural Networks

MATH 680 Fall November 27, Homework 3

Linear Regression (continued)

Ad Placement Strategies

Deep Feedforward Networks

The Perceptron Algorithm

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

A summary of Deep Learning without Poor Local Minima

Predictive Discrete Latent Factor Models for large incomplete dyadic data

CS-E4830 Kernel Methods in Machine Learning

Markov Chains and Hidden Markov Models

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Introduction to Machine Learning

ECS289: Scalable Machine Learning

Linear Models in Machine Learning

Graphical Model Selection

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

CS425: Algorithms for Web Scale Data

Click Prediction and Preference Ranking of RSS Feeds

The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks

Least Mean Squares Regression. Machine Learning Fall 2018

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Point-of-Interest Recommendations: Learning Potential Check-ins from Friends

A Gradient-based Adaptive Learning Framework for Efficient Personal Recommendation

Final Exam, Machine Learning, Spring 2009

ECS171: Machine Learning

Exercise 1. In the lecture you have used logistic regression as a binary classifier to assign a label y i { 1, 1} for a sample X i R D by

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

Machine Learning Linear Models

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

ECE521 Lectures 9 Fully Connected Neural Networks

Statistical Ranking Problem

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

arxiv: v2 [cs.lg] 5 May 2015

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

CS6220: DATA MINING TECHNIQUES

Overfitting, Bias / Variance Analysis

Linear Regression. S. Sumitra

Logistic Regression Logistic

Weighted Least Squares I

Reading Group on Deep Learning Session 1

Linear Models for Regression

Bits of Machine Learning Part 1: Supervised Learning

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Introduction to Logistic Regression

Linear Models for Regression CS534

Sample questions for Fundamentals of Machine Learning 2018

Variations of Logistic Regression with Stochastic Gradient Descent

Online Passive-Aggressive Algorithms. Tirgul 11

Relational Stacked Denoising Autoencoder for Tag Recommendation. Hao Wang

Transcription:

Click-Through Rate prediction: TOP-5 solution for the Avazu contest Dmitry Efimov Petrovac, Montenegro June 04, 2015

Outline Provided data Likelihood features FTRL-Proximal Batch algorithm Factorization Machines Final results

Competition

Provided data Device layer: id, model, type Time layer: day, hour Site layer: id, domain, category Basic features Connection layer: ip, type Banner layer: position, C1, C14-C21 Application layer: id, domain, category

Notations X: m n design matrix m train = 40 428 967 m test = 4 577 464 n = 23 y: binary target vector of size m x j : column j of matrix X x i : row i of matrix X 1 σ(z) = : sigmoid function 1 + e z

Evaluation Logarithmic loss for y i {0, 1}: L = 1 m m (y i log(ŷ i ) + (1 y i ) log(1 ŷ i )) i=1 ŷ i is a prediction for the sample i Logarithmic loss for y i { 1, 1}: L = 1 m m log(1 + e y i p i ) i=1 p i is a raw score from the model ŷ i = σ(p i ), i {1,..., m}

Feature engineering Blocks: combination of two or more features Counts: number of samples for different feature values Counts unique: number of different values of one feature for fixed value of another feature Likelihoods: min θt L, where θ t = P(y i x ij = t) Others

Feature engineering (algorithm 1) x11 x 1j1 x 1js x1n X : xm1 x mj1 x mjs xmn Step 1 x 1j1 x 1js x i1j1 x i1js Z : x ir x j1 ir js x mj1 x mjs Step 2 x i1j1 x i1js Z1 Zt : Z T (Z ) x ir j1 x ir js c i1 =... = c ir = r Step 3 p i1 =... = p ir = y +... + y i1 ir r b i1 =... = b ir = t x i1k1 x i1kq At : function SPLITBYROWS(M) get partition of the matrix M into {M t } by rows such that each M t has identical rows Require: J = {j 1,..., j s} {1,..., n} K = {k 1,..., k q} {1,..., n}\j Z (x ij ) X, i {1, 2,..., m}, j J for I = {i 1,..., i r } SPLITBYROWS(Z ) c i1 =... = c ir = r p i1 =... = p ir = y i 1 +... + y ir r b i1 =... = b ir = t A t = (x ik ) X, i I, k K T (A t ) = size(splitbyrows(a t ) ) u i1 =... = u ir = T (A t ) x ir x k1 ir kq Step 4 B1... B T (At ) Step 5 u i1 =... = u ir = T (At)

Feature engineering (algorithm 2) function SPLITBYROWS(M) get partition of the matrix M into {M t } by rows such that each M t has identical rows Require: parameter α > 0 J (j 1,..., j s ) {1,..., n} increasing sequence V (v 1,..., v l ) {1,..., s}, v 1 < s f i y 1 +... + y m, i {1,..., m} m for v V do J v = {j 1,..., j v }, Z = (x ij ) X, i {1, 2,..., m}, j J v for I = {i 1,..., i r } SPLITBYROWS(Z ) do c i1 =... = c ir = r p i1 =... = p ir = y i 1 +... + y ir r w = σ( c + α) - weight vector f i = (1 w i ) f i + w i p i, i {1,..., m}

FTRL-Proximal model Weight updates: ( i w i+1 = arg min g r w + 1 w r=1 2 ( = arg min w w where i r=1 τ rj = ) i τ r w w r 2 2 + λ 1 w 1 = r=1 i (g r τ r w r ) + 1 2 w 2 2 r=1 β + i ( ) 2 grj r=1 α i r=1 + λ 2, j {1,..., N}, ) τ r + λ 1 w 1 + const, λ 1, λ 2, α, β - parameters, τ r = (τ r1,..., τ rn ) - learning rates, g r - gradient vector for the step r

FTRL-Proximal Batch model Require: parameters α, β, λ 1, λ 2 z j 0 and n j 0, j {1,..., N} for i = 1 to m do receive sample vector x i and let J = {j x ij 0} for j J do 0 if z j λ ( 1 ) w j = β + 1 nj ( ) + λ 2 zj sign(z j )λ 1 otherwise α predict ŷ i = σ(x i w) using the w j and observe y i if y i {0, 1} then for j J do g j = ŷ i y i - gradient direction of loss w.r.t. w j τ j = 1 ( n j + g j α 2 ) n j z j = z j + g j τ j w j n j = n j + g 2 j

Performance Description Leaderboard score dataset is sorted by app id, site id, banner pos, count1, day, hour 0.3844277 dataset is sorted by app domain, site domain, count1, day, hour 0.3835289 dataset is sorted by person, day, hour 0.3844345 dataset is sorted by day, hour with 1 iteration 0.3871982 dataset is sorted by day, hour with 2 iterations 0.3880423

Factorization Machine (FM) Second-order polynomial regression: n 1 ŷ = σ n w jk x j x k j=1 k=j+1 Low rank approximation (FM): n 1 n ŷ = σ (v j1,..., v jh ) (v k1,..., v kh )x j x k j=1 k=j+1 H is a number of latent factors

Factorization Machine for categorical dataset Assign set of latent factors for each pair level-feature: ŷ i = σ 2 n 1 n (w xij k1,..., w xij kh) (w xik j1,..., w xik jh) n j=1 k=j+1 Add regularization term: L reg = L + 1 2 λ w 2 The gradient direction: g xij kh = L reg w xij kh = 2 n y i e y i p i 1 + e y i p i w xik jh + λw xij kh ( ) 2 Learning rate schedule: τ xij kh = τ xij kh + g xij kh Weight update: w xij kh = w xij kh α τ xij kh g xij kh

Ensembling Model Description Leaderboard score ftrlb1 dataset is sorted by app id, site id, banner pos, count1, day, hour 0.3844277 ftrlb2 dataset is sorted by app domain, site domain, count1, day, hour 0.3835289 ftrlb3 dataset is sorted by person, day, hour 0.3844345 fm factorization machine 0.3818004 ens fm 0.6 ftrlb1 0.1 ftrlb2 0.2 ftrlb3 0.1 0.3810447

Final results Place Team Leaderboard score Difference between the 1st place 1 4 Idiots 0.3791384 2 Owen 0.3803652 0.32% 3 Random Walker 0.3806351 0.40% 4 Julian de Wit 0.3810307 0.50% 5 Dmitry Efimov 0.3810447 0.50% 6 Marios and Abhishek 0.3828641 0.98% 7 Jose A. Guerrero 0.3829448 1.00%

Future work apply the batching idea to the Factorization Machine algorithm find a better sorting for the FTRL-Proximal Batch algorithm find an algorithm that can find better sorting without cross-validation procedure

References H.Brendan McMahan et al. Ad click prediction: a view from the trenches. In KDD, Chicago, Illinois, USA, August 2013. Wei-Sheng Chin et al. A learning-rate schedule for stochastic gradient methods to matrix factorization. In PAKDD, 2015. Michael Jahrer et al. Ensemble of collaborative filtering and feature engineered models for click through rate prediction. In KDD Cup, 2012 Steffen Rendle. Social network and click-through prediction with factorization machines. In KDD Cup, 2012

Thank you! Questions??? Dmitry Efimov defimov@aus.edu