Introduction to Computational Advertising

Size: px

Start display at page:

Download "Introduction to Computational Advertising"

Edwina Hampton
6 years ago
Views:

1 Introduction to Computational Advertising MS&E 9 Stanford University Autumn Instructors: Dr. Andrei Broder and Dr. Vanja Josifovski Yahoo! Research

2 General course info Course Website: Instructors Dr. Andrei Broder, Yahoo! Research, Dr. Vanja Josifovski, Yahoo! Research, TA: Krishnamurthy Iyer Office hours: Tuesdays 6:pm-7:pm, Huang Course lists Staff:msande9-aut-staff All: msande9-aut-students Please use the staff list to communicate with the staff Lectures: am ~ :pm Fridays in HP Office Hours: After class and by appointment Andrei and Vanja will be on campus for times each to meet and discuss with students. Feel free to come and chat about even issues that go beyond the class.

3 Course Overview (subject to change). 9/ Overview and Introduction. /7 Marketplace and Economics. / Textual Advertising : Sponsored Search. / Textual Advertising : Contextual Advertising. /8 Display Advertising 6. / Display Advertising 7. / Targeting 8. /8 Recommender Systems 9. / Mobile, Video and other Emerging Formats. /9 Project Presentations

4 Lecture 8: Recommender Systems for Display Advertising Targeting

5 Disclaimers This talk presents the opinions of the authors. It does not necessarily reflect the views of Yahoo! inc or any other entity. Algorithms, techniques, features, etc mentioned here might or might not be in use by Yahoo! or any other company. This lecture is largely based on slides by Yehuda Koren. His help very much appreciated! Other contributors: Amr Ahmed and Alex Smola

6 Lecture 8 plan Problem definition Neighborhood based approaches Matrix factorization approaches Generative models for factorization: Latent Dirichlet Allocation External information in campaign optimization (time permitting) 6

7 Checkpoint - targeting Targeting is a key step in differentiation of impressions and extracting value! Traditional targeting: demo, geo, BT How to get the data from the user? Infer the data from historical activity One of the key step in targeting is user profile generation Generative models to assign probability of a sequence of events Weighting based on time, event type and content Predict the counts of events in certain categories Clustering and other unsupervised techniques useful more to come in the next lecture 7

8 Recommender systems What is actually recommender system technology? A set of techniques to recommend items based on explicit (rating) or implicit (page visits, ad clicks) It collects the user responses and assumes the items are opaque Usually does not take in account the content of the item Opposed to matching of ads we have discussed so far Recent techniques combine the two Shown to be effective in many domains, including advertising Slides mostly on movie ratings, we will discuss the similarities/differences wit the ad domain 8

9 Collaborative filtering Recommend items based on past transactions of many users Analyze relations between users and/or items Specific data characteristics are irrelevant Domain-free: user/item attributes are not necessary Can identify elusive aspects

11 score date movie user /7/ 8// /6/ // 7// 768 // 76 8// 9// 68 // /8/ 8// // 6 6 score date movie user? /6/ 6? 9// 96? 8/8/ 7? //? 6// 7? 8//? 9//? 8/7/ 8? // 9? 7/6/ 7? // 69 6? // 8 6 Training data Test data Movie rating data

12 Alternate view of the data: matrix ? 6 users items

13 Major Challenges. Countless factors may affect preferences Genre, movie/tv series/other Style of action, dialogue, plot, music et al. Director, actors. Large imbalances. Scalability Most user-item preferences are unknown Number of ratings per user or item may vary by several orders of magnitude Information to estimate individual parameters varies widely Some datasets contain millions of users/items

14 Conventions r ui r ˆui - rating by user u to item i - predicted rating by user u to item i Error function: rmse( S) = ( ui, ) S rˆ ( ) ui S r ui

15 How does this map to the ad world Ad matrix a lot sparser As with movies, no info does not mean negative response We could determine negative responses by analysis of user history Ranking metrics might be better option AUC of ROC curve Need to limit to the top-k items We cannot show every ad to every user In practice combine rec sys methods with predictive modeling for best performance

16 Neighborhood methods # # # Joe #

17 Neighborhood-based CF Earliest and most popular collaborative filtering method Derive unknown ratings from those of similar items (item-item variant) A parallel user-user flavor: rely on ratings of like-minded users

18 Neighborhood-based CF users items - unknown rating - rating between to

19 Neighborhood-based CF ? 6 users items - estimate rating of item by user

20 Neighborhood-based CF ? 6 users Neighbor selection: Identify items similar to, rated by user items

21 Neighborhood-based CF ? 6 users Compute similarity weights: s =., s 6 =. items

22 Neighborhood-based CF users Predict by taking a weighted average: (.*+.*)/(.+.)=.6 items

23 Properties of neighborhood-based CF Intuitive Easy to explain reasoning behind a recommendation Handles new ratings/users seamlessly No substantial preprocessing is required (?) Accurate (enough?)

24 Data normalization Need to identify relations and mix ratings across items/ users However: User and item-specific variability masks fundamental relationships Examples: Some items are systematically rated higher Some items were rated by users that tend to rate low Ratings change along time Normalization is critical to the success of a k-nn approach

25 Data normalization Remove data characteristics that are unlikely to be explained by k- NN Common practice is to use centering: Remove user- and item-means A more comprehensive approach eliminates additional interfering variability such as time effects See global Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights, ICDM 7 Here, we normalize by removing the baseline predictors: b ui = µ + b u + b i Global mean User bias Item bias

26 Baseline predictors (biases) Mean rating:.7 stars ( ) The Sixth Sense is. stars above avg (b i ) Joe rates. stars below avg (b u ) è Baseline estimation: Joe will rate The Sixth Sense stars ( + b i + b u )

27 Estimation of biases Try to explain each r ui in the train set as Solve the regularized least squares problem: An alternative: µ + b + b min ( r µ b b) + λ ( b + b ) b* ui u i u i ( ui, ) K u i Error for a training case u Regularization First, estimate item biases by averaging over users that rated the item: b i = u R( i) ( r µ ) ui λ + R() i Then, estimate user biases by averaging residuals over items rated by the user: b u = i R( u) ( r µ b) ui λ + R( u) i i

28 Residual: ratings of similar items by the same user. Define a similarity measure between items: s ij. Use s ij to select neighbors s k (i;u): k items most similar to i, that were rated by u. Estimate unknown rating, r ui, as the weighted average rating that u gave to the neighbors: rˆ ui = b + ui k j S ( i; u) s ( r b ) k ij uj uj j S ( i; u) How to compute item-item similarity? s ij

29 Estimating item-item similarities Common practice rely on Pearson correlation coeff Challenge non-uniform user support of item ratings, each item rated by a distinct set of users User ratings for item i:??????????? User ratings for item j:??????????? Compute correlation over shared support

30 Estimating item-item similarities Empirical Pearson correlation coefficient on shared support of items i and j: ˆ ρ = ij u U( i, j) ( r b )( r b ) ui ui uj uj ( r b ) ( r b ) ui ui uj uj u U( i, j) u U( i, j) U(i,j) contains the users who rated both items i and j Estimates with smaller supports are less reliable Use shrunk correlation coeff as a similarity measure: s ij = λ penalizes small supports: Uij (, ) ˆ ρij Uij (, ) + λ U(i,j) << λ è s ij U(i,j) >>λ è s ij ˆij ρ

31 Improvements to common practice Use transformed similarities as interpolation coeff s E.g., by squaring we emphasize stronger relations: rˆ ui = b + ui k j S ( i; u) s ( r b ) k ij uj uj j S ( i; u) See A. Toscher, M. Jahrer and R. Legenstein, Improved Neighborhood-Based Algorithms for Large-Scale Recommender Systems for sigmodial transformations Shrink towards baseline when not enough neighborhood info is available: rˆ ui = b + ui % ' & k j S ( i; u) λ + s ij s ( r b ) ij uj uj k j S ( i; u) s ij " s ij!! # ˆr ui $ b ui j!s k (i;u) ( * )

32 Similarities for binary data Often data is not ratings but binary. Example: ad clicks and conversions This requires other natural similarity measures Notation: m i - #users acting on i m ij - #users acting on both i and j m - overall #users () Jaccard similarity: s Shrink estimates: s ij ij = = m ij m + m m i j ij m ij α + m + m m i j ij

33 Similarities for binary data # () Observed/Expected ratio: Under random sampling, expected i-j intersection : m m (expectation of a hypergeometric distribution) i j m observed mij = expected ( m m / m) i j As usual, need to shrink: s ij = α + m ij ( m m / m) i j

34 A user-user approach Dual to the so-far described item-item approach (with similar derivation) Predict a rating from ratings of similar users on the same item Building stones are user-user similarities s uv : rˆ ui = b + ui k v S ( u; i) λ + s ( r b ) uv vi vi k v S ( u; i) Item-item is commonly considered advantageous over user-user when #items < #users: less item-item relations to store, more stable relations, more reliable estimation Item-item meshes better with new users and explaining rec s In some cases user-user becomes more sensible: When users are the more stable anchor of the system (e.g. items are web articles that quickly expire) When #users < #items s uv

35 Part II: Matrix factorization techniques ~

36 Latent factor models serious The Color Purple Amadeus Braveheart Geared towards females Sense and Sensibility Ocean s Lethal Weapon Geared towards males Dave The Princess Diaries The Lion King escapist Independence Day Gus Dumb and Dumber

37 Basic matrix factorization model items ~ ~ items users users A rank- SVD approximation

38 Estimate unknown ratings as inner-products of factors: items ~ ~ items users A rank- SVD approximation users?

39 Estimate unknown ratings as inner-products of factors: items ~ ~ items users A rank- SVD approximation users?

40 Estimate unknown ratings as inner-products of factors: items ~ ~ items users. A rank- SVD approximation users

41 Matrix factorization model ~ Why can t use standard SVD R = M V? SVD isn t defined when entries are missing Regularization is necessary: Estimate as much signal as possible where there are sufficient data, without over fitting where data are scarce

42 A regularized model Limit the values the factors can take Unlimited values will produce an overfit Reduce the optimization space User factors: Model a user u as a vector p u ~ N k (µ, Σ) Item factors: Model an item i as a vector q i ~ N k (, ) Ratings: Measure agreement between u and i: r ui ~ N(p ut q i, ε ) Simplifying assumptions: µ = =, Σ= = λi

43 Matrix factorization as a cost function Min ( ) ( ) T r p q + λ p + q p*, q* ui u i u i known r ui p u - user-factor of u regularization q i - item-factor of i r ui - rating by u for i Optimize by either stochastic gradient-descent or alternating least squares

44 Stochastic gradient descent optimization Perform till convergence: For each training example r ui : Compute prediction error: e ui = r ui p ut q i Update item factor: q i ß q i +γ(p u e ui -λq i ) Update user factor: p u ß p u +γ(q i e ui -λp u ) Two constants to tune: γ (step size) and λ (regulariza7on) Find values that minimize error on valida&on set See, e.g., Simon Funk, Netflix Update: Try This at Home,

45 R P Q

46 R P Q

47 Matrix factorization with biases T rˆ = µ + b + b + p q ui u i u i Baseline predictors: µ global average b u bias of u b i bias of i è Minimization problem: ( ) ( ) T r µ + b + b + p q + λ p + q + b + b Min ( ) p*, p*, b* ui u i u i u i u i known r ui regularization See, e.g., A. Paterek, Improving regularized singular value decomposition for collaborative filtering, Proc. KDD Cup and Workshop 7

48 Ratings values vs rating occurrences There is information in the fact that user has rated a movie The user chose to see the movie The user chose to rate the movie The choice depends on many factors Can we use this information to improve the factorization?

49 Ratings are not given at random! Distribution of ratings Netflix ratings Yahoo! music ratings Yahoo! survey answers B. Marlin et al., Collaborative Filtering and the Missing at Random Assumption UAI 7

50 A powerful source of information: Characterize users by which items they rated, rather than how they rated è A dense binary representation of the data: users items users items Which items users rate? { }, R r ui ui = { }, B b ui ui =

51 Factoring the binary view Describe both R and B using a factor model: r = µ + b + b + p T q ui u i u i b = p T x ui u i User factors are shared across models Each item i is associated with two factor vectors: q i and x i

52 Factoring the binary view T ui u i i: b = p x X = ( x, x,!, x ) n B = ( b,b,!,b ) u u u un p u = (XX T )! XB u = p XB b x u u uj j j User factor is indirectly defined by item factors: sum of item factors for items rated by u

53 Integrating ratings and binary views So far: Each user u is associated with a factor vector p u Each item i is associated with two factor vectors: q i and x i The pure rating model: r = µ + b + b + q p T ui u i i u The binary view of the user factors: pu bujxj = xj j j rated by u

54 Integrating ratings and binary views So far: Each user u is associated with a factor vector p u Each item i is associated with two factor vectors: q i and x i The pure rating model: r = µ + b + b + q p T ui u i i u The binary view of the user factors: pu bujxj = xj j j rated by u r = µ + b + b + q p + b x T ui u i i u uj j j Factorization Meets the Neighborhood, KDD 8 R. Salakhutdinov and A. Mnih, Probabilistic Matrix Factorization, NIPS 7

55 RMSE Factor models: Error vs. #parameters (Ne=lix) who rated what NMF BiasSVD SVD++ SVD v. SVD v. SVD v..87 Millions of Parameters

56 Temporal dynamics Panta rhei 9

57 Something Happened in Early Netflix ratings by date

58 Are movies getting better with time?

59 Multiple sources of temporal dynamics Item-side effects: Product perception and popularity are constantly changing Seasonal patterns influence items popularity User-side effects: Customers redefine their taste Transient, short-term bias; anchoring Drifting rating scale Change of rater within household

60 Temporal dynamics - challenges Multiple effects: Both items and users are changing over time è Scarce data per target Inter-related targets: Signal needs to be shared among users foundation of collaborative filtering è cannot isolate multiple problems è Common concept drift methodologies won t hold. E.g., underweighting older instances is unappealing

61 Addressing temporal dynamics Factor model conveniently allows separately treating different aspects We observe changes in:. Rating scale of individual users Baseline predictors. Popularity of individual items. User preferences User factors T r ( t) = µ + b () t + b( t) + q p ( t) ui u i i u

62 Parameterizing the model T r ( t) = µ + b () t + b( t) + q p ( t) ui u i i u Use functional forms: b u (t)=f(u,t), b i (t)=g(i,t), p u (t)=h(u,t) Need to find adequate f(), g(), h() General guidelines: Items show slower temporal changes Users exhibit frequent and sudden changes Factors p u (t) are expensive to model Gain flexibility by heavily parameterizing the functions Collaborative Filtering with Temporal Dynamics, KDD 9

63 Factor models: Error vs. #parameters NMF BiasSVD SVD++ RMSE temporal effects SVD v. SVD v. SVD v..87 Millions of Parameters

CS425: Algorithms for Web Scale Data

CS425: Algorithms for Web Scale Data CS: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS. The original slides can be accessed at: www.mmds.org J. Leskovec,