Ad Placement Strategies

Similar documents
Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Ad Placement Strategies

Logistic Regression Logistic

Stochastic Gradient Descent

Linear classifiers: Overfitting and regularization

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Introduction to Logistic Regression and Support Vector Machine

Convergence rate of SGD

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Classification Logistic Regression

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Logistic Regression. Stochastic Gradient Descent

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Classification: Logistic Regression from Data

Bias-Variance Tradeoff

PAC-learning, VC Dimension and Margin-based Bounds

Midterm, Fall 2003

CS60021: Scalable Data Mining. Large Scale Machine Learning

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning for NLP

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Introduction to Machine Learning Midterm Exam

Discriminative Models

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Logistic Regression. Machine Learning Fall 2018

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

Linear & nonlinear classifiers

Collaborative Filtering Matrix Completion Alternating Least Squares

Introduction to Machine Learning Midterm, Tues April 8

1 Maintaining a Dictionary

Support Vector Machines

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Warm up: risk prediction with logistic regression

Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Classification: Logistic Regression from Data

Discriminative Models

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Randomized Algorithms III

EE 511 Online Learning Perceptron

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Is the test error unbiased for these programs?

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Midterm: CS 6375 Spring 2015 Solutions

Introduction to Logistic Regression

Adaptive Gradient Methods AdaGrad / Adam

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

9 Classification. 9.1 Linear Classifiers

Kernelized Perceptron Support Vector Machines

10-701/ Machine Learning, Fall

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Logistic regression and linear classifiers COMS 4771

CSE 546 Final Exam, Autumn 2013

CS Machine Learning Qualifying Exam

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

ECE 5984: Introduction to Machine Learning

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Support Vector Machines

Naïve Bayes classification

1 Machine Learning Concepts (16 points)

Support Vector Machines

PAC-learning, VC Dimension and Margin-based Bounds

FINAL: CS 6375 (Machine Learning) Fall 2014

Regression with Numerical Optimization. Logistic

Machine Learning: Assignment 1

ECE521 Lecture 7/8. Logistic Regression

Dimensionality reduction

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

The Naïve Bayes Classifier. Machine Learning Fall 2017

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Machine Learning Lecture 6 Note

SVMs, Duality and the Kernel Trick

Generative v. Discriminative classifiers Intuition

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

COMS 4771 Introduction to Machine Learning. Nakul Verma

Overfitting, Bias / Variance Analysis

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Click Prediction and Preference Ranking of RSS Feeds

ECE521 week 3: 23/26 January 2017

Voting (Ensemble Methods)

Linear Discrimination Functions

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Machine Learning, Fall 2012 Homework 2

Logistic Regression. William Cohen

Transcription:

Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January 14 th, 2014 1 Ad Placement Strategies Companies bid on ad prices Which ad wins? (many simplifications here) Naively: But: Instead: Emily Fox 2014 2 1

Learning Problem for Click Prediction Prediction task: Features: Data: Batch: Online: Many approaches (e.g., logistic regression, SVMs, naïve Bayes, decision trees, boosting, ) Focus on logistic regression; captures main concepts, ideas generalize to other approaches Emily Fox 2014 3 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular functional form Sigmoid applied to a linear function of the data: Z Features can be discrete or continuous! Emily Fox 2014 4 2

Standard v. Regularized Updates Maximum conditional likelihood estimate (t) Regularized maximum conditional likelihood estimate 2 3 w = arg max ln 4 Y X P (y j x j, w)) 5 wi 2 w j 2 i>0 (t) Emily Fox 2014 5 Stochastic Gradient Ascent for Logistic Regression Logistic loss as a stochastic function: E x [`(w, x)] = E x ln P (y x, w) Batch gradient ascent updates: i 8 < + : w(t) i + 1 N Stochastic gradient ascent updates: Online setting: n i + t w (t) i NX j=1 x (j) w 2 2 2 9 = i [y (j) P (Y =1 x (j), w (t) )] ; o + x (t) i [y (t) P (Y =1 x (t), w (t) )] Emily Fox 2014 6 3

AdaGrad in Euclidean Space For W = R d, For each feature dimension, where i t,i = t,i g t,i That is, Each feature dimension has it s own learning rate! Adapts with t i q Pt =1 g2,i g t,i Takes geometry of the past observations into account Primary role of η is determining rate the first time a feature is encountered Emily Fox 2014 7 Problem 1: Complexity of Update Rules for Logistic Regression Logistic regression update: n i + t w (t) i + x (t) i [y (t) P (Y =1 x (t), w (t) )] o Complexity of updates: Constant in number of data points In number of features? Problem both in terms of computational complexity and sample complexity What can we with very high dimensional feature spaces? Kernels not always appropriate, or scalable What else? Emily Fox 2014 8 4

Problem 2: Unknown Number of Features For example, bag-of-words features for text data: Mary had a little lamb, little lamb What s the dimensionality of x? What if we see new word that was not in our vocabulary? Obamacare Theoretically, just keep going in your learning, and initialize w Obamacare = 0 In practice, need to re-allocate memory, fix indices, A big problem for Big Data Emily Fox 2014 9 What Next? Hashing & Sketching! Addresses both dimensionality issues and new features in one approach! Let s start with a much simpler problem: Is a string in our vocabulary? Membership query How do we keep track? Explicit list of strings Very slow Fancy Trees and Tries Hard to implement and maintain Hash tables? Emily Fox 2014 10 5

Hash Functions and Hash Tables Hash functions map keys to integers (bins): Keys can be integers, strings, objects, Simple example: mod h(i) = (a.i + b) % m Random choice of (a,b) (usually primes) If inputs are uniform, bins are uniformly used From two results can recover (a,b), so not pairwise independent -> Typically use fancier hash functions Hash table: Store list of objects in each bin Exact, but storage still linear in size of object ids, which can be very long E.g., hashing very long strings, entire documents Emily Fox 2014 11 Hash Bit-Vector Table-based Membership Query Approximate queries with one-sided error: Accept false positives only If we say no, element is not in set If we say yes, element is very to be likely in set Given hash function, keep binary bit vector v of length m: Query Q(i): Element i in set? Collisions: Guarantee: One-sided errors, but may make many mistakes How can we improve probability of correct answer? Emily Fox 2014 12 6

Bloom Filter: Multiple Hash Tables Single hash table -> Many false positives Multiple hash tables with independent hash functions Apply h 1 (i),, h d (i), set all bits to 1 Query Q(i)? Significantly decrease probability of false positives Emily Fox 2014 13 Analysis of Bloom Filter Want to keep track of n elements with false positive probability of δ>0 how large m & d? Simple analysis yields: m = n log 2 1 ln 2 d = log 2 1 1.5n log 2 1 Emily Fox 2014 14 7

Sketching Counts Bloom Filter is super cool, but not what we need We don t just care about whether a feature existed before, but to keep track of counts of occurrences of features! (assuming x i integer) Recall the LR update: n i + t w (t) i o + x (t) i [y (t) P (Y =1 x (t), w (t) )] Must keep track of (weighted) counts of each feature: E.g., with sparse data, for each non-zero dimension i in x (t) : Can we generalize the Bloom Filter? Emily Fox 2014 15 Count-Min Sketch: single vector Simpler problem: Count how many times you see each string Single hash function: Keep Count vector of length m every time see string i: Count[h(i)] Count[h(i)] + 1 Again, collisions could be a problem: a i is the count of element i: Emily Fox 2014 16 8

Count-Min Sketch: general case Keep p by m Count matrix p hash functions: Just like in Bloom Filter, decrease errors with multiple hashes Every time see string i: 8j 2{1,...,p} : Count[j, h j (i)] Count[j, h j (i)] + 1 Emily Fox 2014 17 Querying the Count-Min Sketch 8j 2{1,...,p} : Count[j, h j (i)] Count[j, h j (i)] + 1 Query Q(i)? What is in Count[j,k]? Thus: Return: Emily Fox 2014 18 9

Analysis of Count-Min Sketch Set: m = â i =mincount[j, h(i)] a i j p = ln 1 l e m Then, after seeing n elements: â i apple a i + n With probability at least 1-δ Emily Fox 2014 19 Proof of Count-Min for Point Query with Positive Counts: Part 1 Expected Bound I i,j,k = indicator that i & k collide on hash j: Bounding expected value: X i,j = total colliding mass on estimate of count of i in hash j: Bounding colliding mass: Thus, estimate from each hash function is close in expectation Emily Fox 2014 20 10

Proof of Count-Min for Point Query with Positive Counts: Part 2 High Probability Bounds What we know: Count[j, h j (i)] = a i + X i,j E[X i,j ] apple e n Markov inequality: For z 1,,z k positive iid random variables P (8z i : z i > E[z i ]) < k Applying to the Count-Min sketch: Emily Fox 2014 21 But updates may be positive or negative n i + t w (t) i o + x (t) i [y (t) P (Y =1 x (t), w (t) )] Count-Min sketch for positive & negative case a i no longer necessarily positive Update the same: Observe change Δ i to element i: 8j 2{1,...,p} : Count[j, h j (i)] Count[j, h j (i)] + i Each Count[j,h(i)] no longer an upper bound on a i How do we make a prediction? Bound: â i a i apple3 a 1 With probability at least 1-δ 1/4, where a = Σ i a i Emily Fox 2014 22 11

Finally, Sketching for LR n i + t w (t) i o + x (t) i [y (t) P (Y =1 x (t), w (t) )] Never need to know size of vocabulary! At every iteration, update Count-Min matrix: Making a prediction: Scales to huge problems, great practical implications Emily Fox 2014 23 Hash Kernels Count-Min sketch not designed for negative updates Biased estimates of dot products Hash Kernels: Very simple, but powerful idea to remove bias Pick 2 hash functions: h : Just like in Count-Min hashing ξ : Sign hash function Removes the bias found in Count-Min hashing (see homework) Define a kernel, a projection ϕ for x: Emily Fox 2014 24 12

Hash Kernels Preserve Dot Products i(x) = X (j)x j j:h(j)=i Hash kernels provide unbiased estimate of dot-products! Variance decreases as O(1/m) Choosing m? For ε>0, if m = O Under certain conditions Then, with probability at least 1-δ: log! N 2 (1 ) x x 0 2 2 apple (x) (x 0 ) 2 2 apple (1 + ) x x 0 2 2 Emily Fox 2014 25 Learning With Hash Kernels Given hash kernel of dimension m, specified by h and ξ Learn m dimensional weight vector Observe data point x Dimension does not need to be specified a priori! Compute ϕ(x): Initialize ϕ(x) For non-zero entries j of x j : Use normal update as if observation were ϕ(x), e.g., for LR using SGD: n o i + t w (t) i + i (x (t) )[y (t) P (Y =1 (x (t) ), w (t) )] Emily Fox 2014 26 13

Interesting Application of Hash Kernels: Multi-Task Learning Personalized click estimation for many users: One global click prediction vector w: But A click prediction vector w u per user u: But Multi-task learning: Simultaneously solve multiple learning related problems: Use information from one learning problem to inform the others In our simple example, learn both a global w and one w u per user: Prediction for user u: If we know little about user u: After a lot of data from user u: Emily Fox 2014 27 Problems with Simple Multi-Task Learning Dealing with new user is annoying, just like dealing with new words in vocabulary Dimensionality of joint parameter space is HUGE, e.g. personalized email spam classification from Weinberger et al.: 3.2M emails 40M unique tokens in vocabulary 430K users 16T parameters needed for personalized classification! Emily Fox 2014 28 14

Hash Kernels for Multi-Task Learning Simple, pretty solution with hash kernels: Very multi-task learning as (sparse) learning problem with (huge) joint data point z for point x and user u: Estimating click probability as desired: Address huge dimensionality, new words, and new users using hash kernels: Desired effect achieved if j includes both just word (for global w) word,user (for personalized w u ) Emily Fox 2014 29 Simple Trick for Forming Projection ϕ(x,u) Observe data point x for user u Dimension does not need to be specified a priori and user can be unknown! Compute ϕ(x,u): Initialize ϕ(x,u) For non-zero entries j of x j : E.g., j= Obamacare Need two contributions to ϕ: Global contribution Personalized Contribution Simply: Learn as usual using ϕ(x,u) instead of ϕ(x) in update function Emily Fox 2014 30 15

Results from Weinberger et al. on Spam Classification: Effect of m Emily Fox 2014 31 Results from Weinberger et al. on Spam Classification: Illustrating Multi-Task Effect Emily Fox 2014 32 16

What you need to know Hash functions Bloom filter Test membership with some false positives, but very small number of bits per element Count-Min sketch Positive counts: upper bound with nice rates of convergence General case Application to logistic regression Hash kernels: Sparse representation for feature vectors Very simple, use two hash function (Can use one hash function take least significant bit to define ξ) Quickly generate projection ϕ(x) Learn in projected space Multi-task learning: Solve many related learning problems simultaneously Very easy to implement with hash kernels Significantly improve accuracy in some problems (if there is enough data from individual users) Emily Fox 2014 33 17