Online Advertising is Big Business

Similar documents
Online Advertising is Big Business

Introduction to Logistic Regression

Logistic Regression. Machine Learning Fall 2018

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Logistic Regression Logistic

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Linear classifiers: Overfitting and regularization

Support Vector Machines

Online Learning and Sequential Decision Making

CSC 411 Lecture 17: Support Vector Machine

Machine Learning for NLP

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Day 4: Classification, support vector machines

6.036 midterm review. Wednesday, March 18, 15

10-701/ Machine Learning - Midterm Exam, Fall 2010

Binary Classification / Perceptron

Introduction to Machine Learning

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Classification Based on Probability

Ad Placement Strategies

Linear Regression (continued)

CS60021: Scalable Data Mining. Large Scale Machine Learning

Day 3: Classification, logistic regression

Optimal and Adaptive Algorithms for Online Boosting

Supervised Learning. George Konidaris

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

CPSC 340: Machine Learning and Data Mining

Boosting: Foundations and Algorithms. Rob Schapire

Training the linear classifier

MLCC 2017 Regularization Networks I: Linear Models

CS221 / Autumn 2017 / Liang & Ermon. Lecture 2: Machine learning I

Support Vector Machines for Classification and Regression

Warm up: risk prediction with logistic regression

Support Vector Machines, Kernel SVM

Classification objectives COMS 4771

Machine Learning and Data Mining. Linear classification. Kalev Kask

CPSC 540: Machine Learning

Support Vector Machines

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Linear classifiers: Logistic regression

Learning by constraints and SVMs (2)

CSCI-567: Machine Learning (Spring 2019)

Machine Learning for NLP

Announcements - Homework

Neural networks and optimization

Classification: Logistic Regression from Data

Introduction to Machine Learning

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

CSE 546 Midterm Exam, Fall 2014

Classification: Logistic Regression from Data

Machine Learning Practice Page 2 of 2 10/28/13

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Lecture 7: Kernels for Classification and Regression

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Introduction to Machine Learning

Matrix Factorization Techniques for Recommender Systems

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Statistical Data Mining and Machine Learning Hilary Term 2016

PAC-learning, VC Dimension and Margin-based Bounds

Discriminative Learning and Big Data

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Is the test error unbiased for these programs?

1 Review of the Perceptron Algorithm

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

4.1 Online Convex Optimization

Neural Networks and Deep Learning

Linear & nonlinear classifiers

CPSC 340: Machine Learning and Data Mining

SVMs, Duality and the Kernel Trick

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

ML4NLP Multiclass Classification

CSC 411 Lecture 7: Linear Classification

Lecture 4: Training a Classifier

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Optimal and Adaptive Online Learning

CIS 520: Machine Learning Oct 09, Kernel Methods

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Logistic Regression. INFO-2301: Quantitative Reasoning 2 Michael Paul and Jordan Boyd-Graber SLIDES ADAPTED FROM HINRICH SCHÜTZE

Support Vector Machines and Kernel Methods

Review: Support vector machines. Machine learning techniques and image analysis

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

DATA MINING AND MACHINE LEARNING

Transcription:

Online Advertising

Online Advertising is Big Business Multiple billion dollar industry $43B in 2013 in USA, 17% increase over 2012 [PWC, Internet Advertising Bureau, April 2013] Higher revenue in USA than cable TV and nearly the same as broadcast TV [PWC, Internet Advertising Bureau, Oct 2013] Large source of revenue for Google and other search engines

Canonical Scalable ML Problem Problem is hard; we need all the data we can get! Success varies by type of online ad (banner, sponsor search, email, etc.) and by ad campaign, but can be less than 1% [Andrew Stern, imedia Connection, 2010] Lots of Data Lots of people use the internet Easy to gathered labeled data A great success story for scalable ML

The Players Publishers: NYTimes, Google, ESPN Make money displaying ads on their sites Advertisers: Marc Jacobs, Fossil, Macy s, Dr. Pepper Pay for their ads to be displayed on publisher sites They want to attract business Matchmakers: Google, Microsoft, Yahoo Match publishers with advertisers In real-time (i.e., as a specific user visits a website)

Why Advertisers Pay? Impressions Get message to target audience e.g., brand awareness campaign Performance Get users to do something e.g., click on ad (pay-per-click) Most common e.g., buy something or join a mailing list

Efficient Matchmaking Idea: Predict probability that user will click each ad and choose ads to maximize probability Estimate P (click predictive features) Conditional probability: probability given predictive features Predictive features Ad s historical performance Advertiser and ad content info Publisher info User info (e.g., search / click history)

Publishers Get Billions of Impressions Per Day But, data is high-dimensional, sparse, and skewed Hundreds of millions of online users Millions of unique publisher pages to display ads Millions of unique ads to display Very few ads get clicked by users Massive datasets are crucial to tease out signal Goal: Estimate P (click user, ad, publisher info) Given: Massive amounts of labeled data

Linear Classification and Logistic Regresssion

Classification Goal: Learn a mapping from observations to discrete labels given a set of training examples (supervised learning) Example: Spam Classification Observations are emails Labels are {spam, not-spam} (Binary Classification) Given a set of labeled emails, we want to predict whether a new email is spam or not-spam

Classification Goal: Learn a mapping from observations to discrete labels given a set of training examples (supervised learning) Example: Click-through Rate Prediction Observations are user-ad-publisher triples Labels are {not-click, click} (Binary Classification) Given a set of labeled observations, we want to predict whether a new user-ad-publisher triple will result in a click

Reminder: Linear Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label, y x = x 1 x 2 x 3 We assume a linear mapping between features and label: y w 0 + w 1 x 1 + w 2 x 2 + w 3 x 3

Reminder: Linear Regression Example: Predicting shoe size from height, gender, and weight We can augment the feature vector to incorporate offset: x = 1 x 1 x 2 x 3 We can then rewrite this linear mapping as scalar product: 3 y ŷ = w i x i = w x i=0

Why a Linear Mapping? Simple Often works well in practice Can introduce complexity via feature extraction Can we do something similar for classification?

Linear Regression Linear Classifier Example: Predicting rain from temperature, cloudiness, and humidity Use the same feature representation: x = 1 x 1 x 2 x 3 How can we make class predictions? {not-rain, rain}, {not-spam, spam}, {not-click, click} We can threshold by sign 3 ŷ = w i x i = w x = ŷ = sign(w x) i=0

Linear Classifier Decision Boundary Decision 3x 1 Boundary 4x2 Imagine w 1=0 x2 = 1 3 x = 1 2 3 :w x= x = 1 2 1 :w x=1 x = 1 5.5 : w x = 12 x = 1 3 2.5 : w x = 4 7 2 3 x1 y = Let swinterpret x = rule: y = sign(w x) i xi = w this i=0y = 1 : w x > 0 y = 1 : w x < 0 Decision boundary: w x = 0

Evaluating Predictions Regression: can measure closeness between label and prediction Song year prediction: better to be off by a year than by 20 years Squared loss: (y ŷ) 2 Classification: Class predictions are discrete 0-1 loss: Penalty is 0 for correct prediction, and 1 otherwise

0/1 Loss Minimization 0/1(z) = 1 if z < 0 0 otherwise (z) 0-1 z = y w x Let y { 1, 1} z = y w x and define z is positive if y and w x have same sign, negative otherwise

How Can We Learn Model (w)? Assume we have n training points, where x (i) denotes the ith point Recall two earlier points: Linear assumption: ŷ = sign(w x) We use 0-1 loss: 0/1(z) Idea: Find w that minimizes average 0-1 loss over training points: n min w 0/1 y (i) w x (i) i=1

0/1 Loss Minimization (z) 0/1(z) = 1 if z < 0 0 otherwise Issue: hard optimization problem, not convex! 0-1 z = y w x Let y { 1, 1} z = y w x and define z is positive if y and w x have same sign, negative otherwise

Approximate 0/1 Loss Logistic Regression Adaboost (z) SVM Solution: Approximate 0/1 loss with convex loss ( surrogate loss) 0-1 z = y w x SVM (hinge), Logistic regression (logistic), Adaboost (exponential)

Approximate 0/1 Loss Logistic Regression Adaboost (z) SVM Solution: Approximate 0/1 loss with convex loss ( surrogate loss) 0-1 z = y w x Logistic loss (logloss): l log (z) = log(1 + e z )

Logistic Regression Optimization Logistic Regression: Learn mapping ( ) that minimizes logistic loss on training data with a regularization term w n Training LogLoss Model Complexity min w log y (i) w x (i) + λ w 2 2 i=1 Convex Closed form solution doesn t exist

Logistic Regression Optimization f(w) Goal: Find w n that minimizes f(w) = log y (i) w x (i) i=1 Can solve via Gradient Descent Step Size w * w Update Rule: w i+1 = w i α f(w) n 1 1 y (j) x (j) j=1 1 +exp( y (j) w i x (j) ) Gradient

Logistic Regression Optimization Regularized Logistic Regression: Learn mapping ( ) that minimizes logistic loss on training data with a regularization term w n Training LogLoss Model Complexity min w log y (i) w x (i) + λ w 2 2 i=1 Convex Closed form solution doesn t exist Can add regularization term (as in ridge regression)