Factor Modeling for Advertisement Targeting

Similar documents
Factor Modeling for Advertisement Targeting

Large-Scale Behavioral Targeting

Collaborative topic models: motivations cont

Generative Clustering, Topic Modeling, & Bayesian Inference

Sparse Stochastic Inference for Latent Dirichlet Allocation

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Collaborative Topic Modeling for Recommending Scientific Articles

A Novel Click Model and Its Applications to Online Advertising

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

Click-Through Rate prediction: TOP-5 solution for the Avazu contest

Scaling Neighbourhood Methods

Behavioral Data Mining. Lecture 2

Graphical Models for Collaborative Filtering

Online Passive-Aggressive Algorithms. Tirgul 11

CS Lecture 18. Topic Models and LDA

Probabilistic Latent Semantic Analysis

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Topic Models and Applications to Short Documents

Large-Scale Social Network Data Mining with Multi-View Information. Hao Wang

Ad Placement Strategies

A Gradient-based Adaptive Learning Framework for Efficient Personal Recommendation

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Mixed Membership Matrix Factorization

Large-scale Collaborative Ranking in Near-Linear Time

Streaming Variational Bayes

Study Notes on the Latent Dirichlet Allocation

Relational Stacked Denoising Autoencoder for Tag Recommendation. Hao Wang

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Mixed Membership Matrix Factorization

Probabilistic Matrix Factorization

Data Mining Techniques

Introduction to Logistic Regression

Content-based Recommendation

Introduction to Machine Learning Midterm, Tues April 8

Circle-based Recommendation in Online Social Networks

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Recommendation Systems

Latent Dirichlet Allocation (LDA)

Recommendation Systems

Nonnegative Matrix Factorization

搜索中的带区域化和个性化的自动补全和自动建议技术 叶旭刚

Counterfactual Model for Learning Systems

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Modeling User Rating Profiles For Collaborative Filtering

Recurrent Latent Variable Networks for Session-Based Recommendation

ECS289: Scalable Machine Learning

Models, Data, Learning Problems

Dirichlet Enhanced Latent Semantic Analysis

Hidden Markov Models

Lab 12: Structured Prediction

Learning from the Wisdom of Crowds by Minimax Entropy. Denny Zhou, John Platt, Sumit Basu and Yi Mao Microsoft Research, Redmond, WA

Restricted Boltzmann Machines for Collaborative Filtering

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

arxiv: v3 [cs.lg] 18 Mar 2013

Recommendation Systems

Probabilistic Language Modeling

Predicting New Search-Query Cluster Volume

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

AN INTRODUCTION TO TOPIC MODELS

Lecture 21: Spectral Learning for Graphical Models

Intelligent Systems (AI-2)

Dynamic Poisson Factorization

Document and Topic Models: plsa and LDA

Latent Semantic Analysis. Hongning Wang

ECS289: Scalable Machine Learning

A Structural Model of Sponsored Search Advertising Auctions

Algorithms for Collaborative Filtering

Gaussian Models

COUNTERFACTUAL REASONING AND LEARNING SYSTEMS LÉON BOTTOU

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Introduction to Machine Learning CMU-10701

Modeling of Growing Networks with Directional Attachment and Communities

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Distributed ML for DOSNs: giving power back to users

1 [15 points] Frequent Itemsets Generation With Map-Reduce

Unsupervised Learning


Introduction to Graphical Models

CS145: INTRODUCTION TO DATA MINING

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Lecture 22 Exploratory Text Analysis & Topic Models

Hidden Markov Models

The Expectation-Maximization Algorithm

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Decision Support Systems MEIC - Alameda 2010/2011. Homework #8. Due date: 5.Dec.2011

Estimation of the Click Volume by Large Scale Regression Analysis May 15, / 50

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Sequential Recommender Systems

Counterfactual Evaluation and Learning

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Click Prediction and Preference Ranking of RSS Feeds

Intelligent Systems (AI-2)

Expectation Maximization (EM)

Learning MN Parameters with Approximation. Sargur Srihari

CS60021: Scalable Data Mining. Large Scale Machine Learning

Probabilistic Latent Semantic Analysis

ECS289: Scalable Machine Learning

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

Transcription:

Ye Chen 1, Michael Kapralov 2, Dmitry Pavlov 3, John F. Canny 4 1 ebay Inc, 2 Stanford University, 3 Yandex Labs, 4 UC Berkeley NIPS-2009 Presented by Miao Liu May 27, 2010

Introduction GaP model Sponsored search Behavioral targeting

Introduction Online Advertising Ad targeting : select most relevant ads to present to a user based on contextual and prior information about this user The training data : user-feature matrix of event counts Events : queries, ad clicks and views Goal: discover statistical structures (factors) latent in the data, dimension reduction, and thus generalize to unseen examples.

Introduction Click Through Rate (CTR) The relevance measure of ad targeting The click-through rate (CTR) c i,j of advertisement a i,j is the probability of a user to click on advertisement a i,j given that the advertisement was displayed to the user for query phrase Q j CTR determins the ad s ranking, placement, pricing, and filtering. Problems 1 real data has positional bias 2 learning from click through data is large scale (terabytes) Two forms of CTR: 1 Sponsored search (SS): p(click ad, user, query) 2 Behavioral targeting (BT): p(click ad, user),or p(click ad category, user)

GaP model GaP model F N n m 0 : data matrix F i,j : the observed count of event (or feature) i {1,..., n} by user j {1,..., m}. Y {{0} R + } n m : expected count X {{0} R + } d m : Factor score matrix X j : low dimensional user j {1,..., m} in a latent space of topics Λ [0, 1] n d : Factor Loading matrix. Λ k : the kth factor(topic), k {1,..., d} Therefore, we have Y = ΛX

GaP model GaP model

GaP model Inference Given a corpus of user data F = (f 1,..., f j,..., f m ), wish to find the maximum likelihood estimates (MLE) of the model parameters (Λ, X) A EM algorithm for nonnegative matrix factorization (NMF)[NIPS02]was developed for GaP model E-step: x n+1 kj M-step: Λ n+1 ik = x n kj = Λ n ik i (f ij Λ ik /y ij )+(α k 1)/x kj i Λ ik +1/β k ij x kj /ȳ ij j x kj

GaP model Two variants for CTR prediction Empirical estimator ˆ CTR ad(i)j = (Λ click(i) X j + δ)/(λ view(i) X j + η) where click(i) and view(i) corresponding the the click/view pair of ad feature i by user j; δ and η are smoothing constants. Considering the relative frequency of counts, F Y = V Z = V (ΛX) where V is the matrix of observed views; X is the matrix of click probabilities; Z = ΛX can directly estimates CTR.

GaP model Rationale for GaP model GaP is a generative probabilistic model for discrete data (such as texts) Similarity to LDA GaP represents each sample (document or user) as a mixture of topics. Key difference between LDA and GaP In LDA, the choice of latent factor is independent word-by-word, while in GaP,several items are chosen from each latent factor. computation for GaP is much simpler. The latent factors in GaP are restricted to be non-negative 1 negative factors have no clear interpretation 2 sufficient for nonnegativity for click-through counts 3 nonzeros will be much more stable and cross-validation error much lower

Sponsored search The Gap deployment for SS Offline training Given the obseved click counts F and view counts V, derive Λ and X using the CTR-based GaP algorithm. Offline user profile updating Given Λ, update user profile X in a distributed and data-local fashion. Online CTR prediction Given global X and a local X learned, CTR for a user given a QL (query-linead)or p(click QL, user) is computed. Z mat j = Λ mat X j

Sponsored search GaP online prediction

Sponsored search Positional normalization The position of an ad has significant effect on CTR To achieve positional normalization, assume the following Markov chain: 1 viewing an ad given its position 2 clicking the ad given a user actually views the ad p(click position) = p(click view)p(view position) p(click position) is estimated from the data matrix F and V (feature-by-position) p(view position) is estimated by applying GaP factorization with one inner dimension to view-position matrix V estimate p(click view) and normalize the user data matrix and learn the linear predictor Z = ΛX

Sponsored search Large-scale implementation Data locality Data sparsity Scalability

Sponsored search Experiments Features: query-linead(ql), query term(qt), and linead term(lt) Dataset: 32 buckets of users and covering one month (first three weeks for training, last week for testing) feature selection: frequency> 30, yielding above 1M features(700k QLs, 175K QTs, adn 135K LTs); 1.6M users latent dimension d = 10, shape parameter α = 1.45 and the scale parameter β = 0.2

Sponsored search GaP Panama QL-CTR 0.82 0.72 0.80 Table: Areas under ROC curves View recall 1% 1-5%avg. 5% 5-20% avg. CTR lift 0.96 0.86 0.93 0.68 Table: CTR lift of Gap over Panama

Behavioral targeting The counts were aggregated over a two-week period of time and from 32 buckets of users, resulting 170K features comprised of 120K ad clicks or views, and 50K pages, 40M users; latent space dimension d = 20; 13 EM iterations ROC areas of 95% Per-ad prediction with d = 10to100 with all ROC area in the arrange [95%, 96%] Running time scales sub-linearly with number of users Number of buckets 32 64 128 512 Run-time (hours) 11.2 18.6 31.7 79.8 Table: Run-time vs. number of user buckets