Ye Chen 1, Michael Kapralov 2, Dmitry Pavlov 3, John F. Canny 4 1 ebay Inc, 2 Stanford University, 3 Yandex Labs, 4 UC Berkeley NIPS-2009 Presented by Miao Liu May 27, 2010
Introduction GaP model Sponsored search Behavioral targeting
Introduction Online Advertising Ad targeting : select most relevant ads to present to a user based on contextual and prior information about this user The training data : user-feature matrix of event counts Events : queries, ad clicks and views Goal: discover statistical structures (factors) latent in the data, dimension reduction, and thus generalize to unseen examples.
Introduction Click Through Rate (CTR) The relevance measure of ad targeting The click-through rate (CTR) c i,j of advertisement a i,j is the probability of a user to click on advertisement a i,j given that the advertisement was displayed to the user for query phrase Q j CTR determins the ad s ranking, placement, pricing, and filtering. Problems 1 real data has positional bias 2 learning from click through data is large scale (terabytes) Two forms of CTR: 1 Sponsored search (SS): p(click ad, user, query) 2 Behavioral targeting (BT): p(click ad, user),or p(click ad category, user)
GaP model GaP model F N n m 0 : data matrix F i,j : the observed count of event (or feature) i {1,..., n} by user j {1,..., m}. Y {{0} R + } n m : expected count X {{0} R + } d m : Factor score matrix X j : low dimensional user j {1,..., m} in a latent space of topics Λ [0, 1] n d : Factor Loading matrix. Λ k : the kth factor(topic), k {1,..., d} Therefore, we have Y = ΛX
GaP model GaP model
GaP model Inference Given a corpus of user data F = (f 1,..., f j,..., f m ), wish to find the maximum likelihood estimates (MLE) of the model parameters (Λ, X) A EM algorithm for nonnegative matrix factorization (NMF)[NIPS02]was developed for GaP model E-step: x n+1 kj M-step: Λ n+1 ik = x n kj = Λ n ik i (f ij Λ ik /y ij )+(α k 1)/x kj i Λ ik +1/β k ij x kj /ȳ ij j x kj
GaP model Two variants for CTR prediction Empirical estimator ˆ CTR ad(i)j = (Λ click(i) X j + δ)/(λ view(i) X j + η) where click(i) and view(i) corresponding the the click/view pair of ad feature i by user j; δ and η are smoothing constants. Considering the relative frequency of counts, F Y = V Z = V (ΛX) where V is the matrix of observed views; X is the matrix of click probabilities; Z = ΛX can directly estimates CTR.
GaP model Rationale for GaP model GaP is a generative probabilistic model for discrete data (such as texts) Similarity to LDA GaP represents each sample (document or user) as a mixture of topics. Key difference between LDA and GaP In LDA, the choice of latent factor is independent word-by-word, while in GaP,several items are chosen from each latent factor. computation for GaP is much simpler. The latent factors in GaP are restricted to be non-negative 1 negative factors have no clear interpretation 2 sufficient for nonnegativity for click-through counts 3 nonzeros will be much more stable and cross-validation error much lower
Sponsored search The Gap deployment for SS Offline training Given the obseved click counts F and view counts V, derive Λ and X using the CTR-based GaP algorithm. Offline user profile updating Given Λ, update user profile X in a distributed and data-local fashion. Online CTR prediction Given global X and a local X learned, CTR for a user given a QL (query-linead)or p(click QL, user) is computed. Z mat j = Λ mat X j
Sponsored search GaP online prediction
Sponsored search Positional normalization The position of an ad has significant effect on CTR To achieve positional normalization, assume the following Markov chain: 1 viewing an ad given its position 2 clicking the ad given a user actually views the ad p(click position) = p(click view)p(view position) p(click position) is estimated from the data matrix F and V (feature-by-position) p(view position) is estimated by applying GaP factorization with one inner dimension to view-position matrix V estimate p(click view) and normalize the user data matrix and learn the linear predictor Z = ΛX
Sponsored search Large-scale implementation Data locality Data sparsity Scalability
Sponsored search Experiments Features: query-linead(ql), query term(qt), and linead term(lt) Dataset: 32 buckets of users and covering one month (first three weeks for training, last week for testing) feature selection: frequency> 30, yielding above 1M features(700k QLs, 175K QTs, adn 135K LTs); 1.6M users latent dimension d = 10, shape parameter α = 1.45 and the scale parameter β = 0.2
Sponsored search GaP Panama QL-CTR 0.82 0.72 0.80 Table: Areas under ROC curves View recall 1% 1-5%avg. 5% 5-20% avg. CTR lift 0.96 0.86 0.93 0.68 Table: CTR lift of Gap over Panama
Behavioral targeting The counts were aggregated over a two-week period of time and from 32 buckets of users, resulting 170K features comprised of 120K ad clicks or views, and 50K pages, 40M users; latent space dimension d = 20; 13 EM iterations ROC areas of 95% Per-ad prediction with d = 10to100 with all ROC area in the arrange [95%, 96%] Running time scales sub-linearly with number of users Number of buckets 32 64 128 512 Run-time (hours) 11.2 18.6 31.7 79.8 Table: Run-time vs. number of user buckets