Online Advertising is Big Business

Similar documents
Online Advertising is Big Business

Logistic Regression. Machine Learning Fall 2018

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Introduction to Logistic Regression

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Support Vector Machines

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Supervised Learning. George Konidaris

Performance Evaluation

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

ECE521 Lecture7. Logistic Regression

Logistic Regression Logistic

Optimal and Adaptive Algorithms for Online Boosting

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Lecture 4 Discriminant Analysis, k-nearest Neighbors

CSC321 Lecture 4: Learning a Classifier

CSC321 Lecture 4: Learning a Classifier

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Lecture 4: Training a Classifier

Classification objectives COMS 4771

Variations of Logistic Regression with Stochastic Gradient Descent

Kernelized Perceptron Support Vector Machines

Lecture 3 Classification, Logistic Regression

Linear classifiers: Logistic regression

Ad Placement Strategies

Linear classifiers: Overfitting and regularization

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

CSC 411 Lecture 7: Linear Classification

Evaluation requires to define performance measures to be optimized

PAC-learning, VC Dimension and Margin-based Bounds

Linear Regression (continued)

Linear classifiers Lecture 3

Linear and Logistic Regression. Dr. Xiaowei Huang

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Lecture 4: Training a Classifier

Announcements - Homework

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Machine Learning and Data Mining. Linear classification. Kalev Kask

Evaluation. Andrea Passerini Machine Learning. Evaluation

2018 EE448, Big Data Mining, Lecture 4. (Part I) Weinan Zhang Shanghai Jiao Tong University

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Day 3: Classification, logistic regression

Performance evaluation of binary classifiers

Performance Evaluation

Applied Machine Learning Annalisa Marsico

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Support Vector Machines

Classification Based on Probability

Least Squares Classification

CSC 411: Lecture 04: Logistic Regression

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Training the linear classifier

Introduction to Machine Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Stephen Scott.

Machine Learning for NLP

Online Learning and Sequential Decision Making

Classification: Analyzing Sentiment

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Gaussian and Linear Discriminant Analysis; Multiclass Classification

CS229 Supplemental Lecture notes

Linear Models for Classification

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

COMP 875 Announcements

Linear & nonlinear classifiers

Support Vector Machines, Kernel SVM

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Machine Learning for NLP

Support vector machines Lecture 4

Binary Classification / Perceptron

MLCC 2017 Regularization Networks I: Linear Models

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

10-701/ Machine Learning - Midterm Exam, Fall 2010

CS221 / Autumn 2017 / Liang & Ermon. Lecture 2: Machine learning I

Learning Neural Networks

Machine Learning 4771

CSC 411 Lecture 17: Support Vector Machine

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Homework 2 Solutions Kernel SVM and Perceptron

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Linear Models in Machine Learning

Optimal and Adaptive Online Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Day 4: Classification, support vector machines

Boosting: Foundations and Algorithms. Rob Schapire

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

9 Classification. 9.1 Linear Classifiers

Logistic Regression. COMP 527 Danushka Bollegala

ML4NLP Multiclass Classification

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Neural Networks and Deep Learning

Directly and Efficiently Optimizing Prediction Error and AUC of Linear Classifiers

Classification Logistic Regression

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Classification and Pattern Recognition

Transcription:

Online Advertising

Online Advertising is Big Business Multiple billion dollar industry $43B in 2013 in USA, 17% increase over 2012 [PWC, Internet Advertising Bureau, April 2013] Higher revenue in USA than cable TV and nearly the same as broadcast TV [PWC, Internet Advertising Bureau, Oct 2013] Large source of revenue for Google and other search engines

Canonical Scalable ML Problem Problem is hard; we need all the data we can get! Success varies by type of online ad (banner, sponsor search, email, etc.) and by ad campaign, but can be less than 1% [Andrew Stern, imedia Connection, 2010] Lots of Data Lots of people use the internet Easy to gathered labeled data A great success story for scalable ML

The Players Publishers: NYTimes, Google, ESPN Make money displaying ads on their sites Advertisers: Marc Jacobs, Fossil, Macy s, Dr. Pepper Pay for their ads to be displayed on publisher sites They want to attract business Matchmakers: Google, Microsoft, Yahoo Match publishers with advertisers In real-time (i.e., as a specific user visits a website)

Why Advertisers Pay? Impressions Get message to target audience e.g., brand awareness campaign Performance Get users to do something e.g., click on ad (pay-per-click) Most common e.g., buy something or join a mailing list

Efficient Matchmaking Idea: Predict probability that user will click each ad and choose ads to maximize probability Estimate P (click predictive features) Conditional probability: probability given predictive features Predictive features Ad s historical performance Advertiser and ad content info Publisher info User info (e.g., search / click history)

Publishers Get Billions of Impressions Per Day But, data is high-dimensional, sparse, and skewed Hundreds of millions of online users Millions of unique publisher pages to display ads Millions of unique ads to display Very few ads get clicked by users Massive datasets are crucial to tease out signal Goal: Estimate P (click user, ad, publisher info) Given: Massive amounts of labeled data

Linear Classification and Logistic Regresssion

Classification Goal: Learn a mapping from observations to discrete labels given a set of training examples (supervised learning) Example: Spam Classification Observations are emails Labels are {spam, not-spam} (Binary Classification) Given a set of labeled emails, we want to predict whether a new email is spam or not-spam

Classification Goal: Learn a mapping from observations to discrete labels given a set of training examples (supervised learning) Example: Click-through Rate Prediction Observations are user-ad-publisher triples Labels are {not-click, click} (Binary Classification) Given a set of labeled observations, we want to predict whether a new user-ad-publisher triple will result in a click

Reminder: Linear Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label, y x = x 1 x 2 x 3 We assume a linear mapping between features and label: y w 0 + w 1 x 1 + w 2 x 2 + w 3 x 3

Reminder: Linear Regression Example: Predicting shoe size from height, gender, and weight We can augment the feature vector to incorporate offset: x = 1 x 1 x 2 x 3 We can then rewrite this linear mapping as scalar product: 3 y ŷ = w i x i = w x i=0

Why a Linear Mapping? Simple Often works well in practice Can introduce complexity via feature extraction Can we do something similar for classification?

Linear Regression Linear Classifier Example: Predicting rain from temperature, cloudiness, and humidity Use the same feature representation: x = 1 x 1 x 2 x 3 How can we make class predictions? {not-rain, rain}, {not-spam, spam}, {not-click, click} We can threshold by sign 3 ŷ = w i x i = w x = ŷ = sign(w x) i=0

Linear Classifier Decision Boundary Decision 3x 1 Boundary 4x2 Imagine w 1=0 x2 = 1 3 x = 1 2 3 :w x= x = 1 2 1 :w x=1 x = 1 5.5 : w x = 12 x = 1 3 2.5 : w x = 4 7 2 3 x1 y = Let swinterpret x = rule: y = sign(w x) i xi = w this i=0y = 1 : w x > 0 y = 1 : w x < 0 Decision boundary: w x = 0

Evaluating Predictions Regression: can measure closeness between label and prediction Song year prediction: better to be off by a year than by 20 years Squared loss: (y ŷ) 2 Classification: Class predictions are discrete 0-1 loss: Penalty is 0 for correct prediction, and 1 otherwise

0/1 Loss Minimization 0/1(z) = 1 if z < 0 0 otherwise (z) 0-1 z = y w x Let y { 1, 1} z = y w x and define z is positive if y and w x have same sign, negative otherwise

How Can We Learn Model (w)? Assume we have n training points, where x (i) denotes the ith point Recall two earlier points: Linear assumption: ŷ = sign(w x) We use 0-1 loss: 0/1(z) Idea: Find w that minimizes average 0-1 loss over training points: n min w 0/1 y (i) w x (i) i=1

0/1 Loss Minimization (z) 0/1(z) = 1 if z < 0 0 otherwise Issue: hard optimization problem, not convex! 0-1 z = y w x Let y { 1, 1} z = y w x and define z is positive if y and w x have same sign, negative otherwise

Approximate 0/1 Loss Logistic Regression Adaboost (z) SVM Solution: Approximate 0/1 loss with convex loss ( surrogate loss) 0-1 z = y w x SVM (hinge), Logistic regression (logistic), Adaboost (exponential)

Approximate 0/1 Loss Logistic Regression Adaboost (z) SVM Solution: Approximate 0/1 loss with convex loss ( surrogate loss) 0-1 z = y w x Logistic loss (logloss): l log (z) = log(1 + e z )

Logistic Regression Optimization Logistic Regression: Learn mapping ( ) that minimizes logistic loss on training data with a regularization term w n Training LogLoss Model Complexity min w log y (i) w x (i) + λ w 2 2 i=1 Convex Closed form solution doesn t exist

Logistic Regression Optimization f(w) Goal: Find w n that minimizes f(w) = log y (i) w x (i) i=1 Can solve via Gradient Descent Step Size w * w Update Rule: w i+1 = w i α f(w) n 1 1 y (j) x (j) j=1 1 +exp( y (j) w i x (j) ) Gradient

Logistic Regression Optimization Regularized Logistic Regression: Learn mapping ( ) that minimizes logistic loss on training data with a regularization term w n Training LogLoss Model Complexity min w log y (i) w x (i) + λ w 2 2 i=1 Convex Closed form solution doesn t exist Can add regularization term (as in ridge regression)

Logistic Regression: Probabilistic Interpretation

Approximate 0/1 Loss Logistic Regression Adaboost (z) SVM Solution: Approximate 0/1 loss with convex loss ( surrogate loss) 0-1 z = y w x SVM (hinge), Logistic regression (logistic), Adaboost (exponential)

Probabilistic Interpretation Goal: Model conditional probability: P[ y = 1 x ] Example: Predict rain from temperature, cloudiness, humidity x x P[ y = rain t = 14 F, c = LOW, h = 2%]=.05 P[ y = rain t = 70 F, c = HIGH, h = 95%]=.9 Example: Predict click from ad s historical performance, user s click frequency, and publisher page s relevance x x P[ y = click h = GOOD, f = HIGH, r = HIGH ]=.1 P[ y = click h = BAD, f = LOW, r = LOW ]=.001

Probabilistic Interpretation Goal: Model conditional probability: P[ y = 1 x ] First thought: P[ y = 1 x ]=w x Linear regression returns any real number, but probabilities range from 0 to 1! How can we transform or squash its output? Use logistic (or sigmoid) function: P[ y = 1 x ]=σ(w x)

Logistic Function σ(z) Maps real numbers to [0, 1] Large positive inputs 1 Large negative inputs 0 z σ(z) = 1 1 +exp( z)

Probabilistic Interpretation Goal: Model conditional probability: P[ y = 1 x ] Logistic regression uses logistic function to model this conditional probability P[ y = 1 x ]=σ(w x) P[ y = 0 x ]=1 σ(w x) For notational convenience we now define y {0, 1}

How Do We Use Probabilities? To make class predictions, we need to convert probabilities to values in {0, 1} We can do this by setting a threshold on the probabilities Default threshold is 0.5 x P[ y = 1 x ] > 0.5 = ŷ = 1 Example: Predict rain from temperature, cloudiness, humidity P[ y = rain t = 14 F, c = LOW, h = 2%]=.05 ŷ = 0 P[ y = rain t = 70 F, c = HIGH, h = 95%]=.9 ŷ = 1

Connection with Decision Boundary? Decision Boundary x 2 predictions: ŷ = sign(w x) Threshold by sign to make class ŷ = 1 : w x > 0 ŷ = 0 : w x < 0 decision boundary: w x = 0 x 1 How does this compare with thresholding probability? P[ y = 1 x ]=σ(w x) > 0.5 = ŷ = 1

Connection with Decision Boundary? σ ( z) σ ( 0 ) = 0.5 3 y = i=0 z w x=0 σ (w x) = 0.5 Threshold by sign to make class wi xi =predictions: w x = y = sign(w x) y = 1 : w x > 0 y = 0 : w x < 0 decision boundary: w x = 0 How does this compare with thresholding probability? x y = 1 x ]P[ x ] > 0.5 = y = 1 P[ =yσ= (w1 x) With threshold of 0.5, the decision boundaries are identical!

Using Probabilistic Predictions

How Do We Use Probabilities? To make class predictions, we need to convert probabilities to values in {0, 1} We can do this by setting a threshold on the probabilities Default threshold is 0.5 x P[ y = 1 x ] > 0.5 = ŷ = 1 Example: Predict rain from temperature, cloudiness, humidity P[ y = rain t = 14 F, c = LOW, h = 2%]=.05 ŷ = 0 P[ y = rain t = 70 F, c = HIGH, h = 95%]=.9 ŷ = 1

Setting different thresholds In spam detection application, we model P[ y = spam x ] Two types of error Classify a not-spam email as spam (false positive, FP) Classify a spam email as not-spam (false negative, FN) Can argue that false positives are more harmful than false negatives Worse to miss an important email than to have to delete spam We can use a threshold greater than 0.5 to be more conservative

ROC Plots: Measuring Varying Thresholds Perfect Classification Threshold = 0 ROC plot displays FPR vs TPR Top left is perfect Dotted Line is random prediction (i.e., biased coin flips) Can classify at various thresholds (T) T = 0: Everything is spam TPR = 1, but FPR = 1 True Positive Rate (TPR) FPR at 0.1 Random Prediction T = 1: Nothing is spam FPR = 0, but TPR = 0 Threshold = 1 False Positive Rate (FPR) We can tradeoff between TPR/FPR FPR: % not-spam predicted as spam TPR: % spam predicted as spam

Working Directly with Probabilities Example: Predict click from ad s historical performance, user s click frequency, and publisher page s relevance P[ y = click h = GOOD, f = HIGH, r = HIGH ]=.1 P[ y = click h = BAD, f = LOW, r = LOW ]=.001 ŷ = 0 ŷ = 0 Success can be less than 1% [Andrew Stern, imedia Connection, 2010] Probabilities provide more granular information Confidence of prediction Useful when combining predictions with other information In such cases, we want to evaluate probabilities directly Logistic loss makes sense for evaluation!

Logistic Loss log(p, y) = log(p) if y = 1 log(1 p) if y = 0 When y = 1, we want p = 1 No penalty at 1 log(p) log(1 p) Increasing penalty away from 1 Similar logic when y = 0 p