Ad Placement Strategies

Similar documents
Convergence rate of SGD

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Adaptive Gradient Methods AdaGrad / Adam

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Classification Logistic Regression

Logistic Regression Logistic

Ad Placement Strategies

Stochastic Gradient Descent

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Linear classifiers: Overfitting and regularization

Generative v. Discriminative classifiers Intuition

Bias-Variance Tradeoff

Notes on AdaGrad. Joseph Perla 2014

Machine Learning for NLP

Generative v. Discriminative classifiers Intuition

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Classification Logistic Regression

CS60021: Scalable Data Mining. Large Scale Machine Learning

Classification: Logistic Regression from Data

Is the test error unbiased for these programs?

Warm up: risk prediction with logistic regression

Support Vector Machines

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Introduction to Machine Learning

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Classification: Logistic Regression from Data

Logistic Regression. COMP 527 Danushka Bollegala

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Summary and discussion of: Dropout Training as Adaptive Regularization

Announcements Kevin Jamieson

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Linear classifiers: Logistic regression

Kernelized Perceptron Support Vector Machines

Lecture 5: Logistic Regression. Neural Networks

Machine Learning Gaussian Naïve Bayes Big Picture

Logistic Regression & Neural Networks

Machine Learning for NLP

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Machine Learning

Logistic Regression. Machine Learning Fall 2018

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Machine Learning Lecture 5

Linear & nonlinear classifiers

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Logistic Regression. Stochastic Gradient Descent

Machine Learning, Fall 2012 Homework 2

Introduction to Logistic Regression and Support Vector Machine

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Linear Models for Classification

Linear Models in Machine Learning

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Announcements. Proposals graded

Generative v. Discriminative classifiers Intuition

10-701/ Machine Learning - Midterm Exam, Fall 2010

Classification Logistic Regression

Large-scale Stochastic Optimization

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Introduction to Logistic Regression

ECE 5984: Introduction to Machine Learning

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Regression with Numerical Optimization. Logistic

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Introduction to Machine Learning

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Day 3 Lecture 3. Optimizing deep networks

Classification Based on Probability

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Logistic Regression. William Cohen

Lecture 2 Machine Learning Review

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Overfitting, Bias / Variance Analysis

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

ECE 5424: Introduction to Machine Learning

Statistical Data Mining and Machine Learning Hilary Term 2016

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Discriminative Models

ECS171: Machine Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Linear Models for Regression

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Optimization and Gradient Descent

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Linear & nonlinear classifiers

CSC 411: Lecture 04: Logistic Regression

Stochastic Subgradient Method

Computational and Statistical Learning Theory

ECE521 week 3: 23/26 January 2017

Optimal and Adaptive Online Learning

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Machine Learning Practice Page 2 of 2 10/28/13

Transcription:

Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad Placement Strategies Companies bid on ad prices Which ad wins? (many simplifications here) Naively: But: Instead:

Key Task: Estimating Click Probabilities What is the probability that user i will click on ad j Not important just for ads: Optimize search results Suggest news articles Recommend products Methods much more general, useful for: Classification Regression Density estimation 3 Learning Problem for Click Prediction Prediction task: Features: Data: Batch: Online: Many approaches (e.g., logistic regression, SVMs, naïve Bayes, decision trees, boosting, ) Focus on logistic regression; captures main concepts, ideas generalize to other approaches 4

Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular functional form Sigmoid applied to a linear function of the data: Z Features can be discrete or continuous! 5 Very convenient! 0 implies 0 linear classification rule! 6 3

Digression: Logistic regression more generally Logistic regression in more general case, where Y in {y,,y R } for k<r for k=r (normalization, so no weights for this class) Features can be discrete or continuous! 7 Loss function: Conditional Likelihood Have a bunch of iid data of the form: Discriminative (logistic regression) loss function: Conditional Data Likelihood 8 4

Expressing Conditional Log Likelihood `(w) = X j = X j y j ln P (Y = x j, w)+( y j )lnp(y =0 x j, w)! dx dx y j (w 0 + w i x j i ) ln +exp(w 0 + w i x j i ) i= i= 9 Maximizing Conditional Log Likelihood = X j! dx dx y j (w 0 + w i x j i ) ln +exp(w 0 + w i x j i ) i= i= Good news: l(w) is concave function of w, no local optima problems Bad news: no closed-form solution to maximize l(w) Good news: concave functions easy to optimize 0 5

Optimizing concave function Gradient ascent Conditional likelihood for Logistic Regression is concave. Find optimum with gradient ascent Gradient: Step size, η>0 Update rule: Gradient ascent is simplest of optimization approaches e.g., Conjugate gradient ascent much better (see reading) Gradient Ascent for LR Gradient ascent algorithm: iterate until change < ε (t) For i =,,d, (t) repeat 6

Regularized Conditional Log Likelihood If data is linearly separable, weights go to infinity Leads to overfitting à Penalize large weights Add regularization penalty, e.g., L : `(w) =ln Y j P (y j x j, w)) w Practical note about w 0 : 3 Standard v. Regularized Updates Maximum conditional likelihood estimate (t) Regularized maximum conditional likelihood estimate 3 w = arg max ln 4 Y X P (y j x j, w)) 5 wi w j i>0 (t) 4 7

Stopping criterion `(w) =ln Y j P (y j x j, w)) w Regularized logistic regression is strongly concave Negative second derivative bounded away from zero: Strong concavity (convexity) is super helpful!! For example, for strongly concave l(w): `(w ) `(w) apple r`(w) 5 Convergence rates for gradient descent/ascent Number of Iterations to get to accuracy `(w ) If func Lipschitz: O(/ϵ ) `(w) apple If gradient of func Lipschitz: O(/ϵ) If func is strongly convex: O(ln(/ϵ)) 6 8

Challenge : Complexity of computing gradients What s the cost of a gradient update step for LR??? (t) 7 Challenge : Data is streaming Assumption thus far: Batch data But, click prediction is a streaming data task: User enters query, and ad must be selected: Observe x j, and must predict y j User either clicks or doesn t click on ad: Label y j is revealed afterwards Google gets a reward if user clicks on ad Weights must be updated for next time: 8 9

Learning Problems as Expectations Minimizing loss in training data: Given dataset: Sampled iid from some distribution p(x) on features: Loss function, e.g., hinge loss, logistic loss, We often minimize loss in training data: `D(w) = N NX `(w, x j ) j= However, we should really minimize expected loss on all data: Z `(w) =E x [`(w, x)] = p(x)`(w, x)dx So, we are approximating the integral by the average on the training data 9 Gradient ascent in Terms of Expectations True objective function: `(w) =E x [`(w, x)] = Z p(x)`(w, x)dx Taking the gradient: True gradient ascent rule: How do we estimate expected gradient? 0 0

SGD: Stochastic Gradient Ascent (or Descent) True gradient: r`(w) =E x [r`(w, x)] Sample based approximation: What if we estimate gradient with just one sample??? Unbiased estimate of gradient Very noisy! Called stochastic gradient ascent (or descent) Among many other names VERY useful in practice!!! Stochastic Gradient Ascent: general case Given a stochastic function of parameters: Want to find maximum Start from w (0) Repeat until convergence: Get a sample data point x t Update parameters: Works on the online learning setting! Complexity of each gradient step is constant in number of examples! In general, step size changes with iterations

Stochastic Gradient Ascent for Logistic Regression Logistic loss as a stochastic function: E x [`(w, x)] = E x ln P (y x, w) Batch gradient ascent updates: w (t+) i w (t) i 8 < + : w(t) i + N Stochastic gradient ascent updates: Online setting: n w (t+) i w (t) i + t w (t) i NX j= x (j) w 9 = i [y (j) P (Y = x (j), w (t) )] ; o + x (t) i [y (t) P (Y = x (t), w (t) )] 3 Convergence rate of SGD Theorem: (see Nemirovski et al 09 from readings) Let f be a strongly convex stochastic function Assume gradient of f is Lipschitz continuous and bounded Then, for step sizes: The expected loss decreases as O(/t): 4

Convergence rates for gradient descent/ascent versus SGD Number of Iterations to get to accuracy `(w ) `(w) apple Gradient descent: If func is strongly convex: O(ln(/ϵ)) iterations Stochastic gradient descent: If func is strongly convex: O(/ϵ) iterations Seems exponentially worse, but much more subtle: Total running time, e.g., for logistic regression: Gradient descent: SGD: SGD can win when we have a lot of data And, when analyzing true error, situation even more subtle expected running time about the same, see readings 5 Motivating AdaGrad (Duchi, Hazan, Singer 0) w R d Assuming, standard stochastic (sub)gradient descent updates are of the form: g t,i w (t+) i w (t) i Should all features share the same learning rate? Often have high-dimensional feature spaces Many features are irrelevant Rare features are often very informative Adagrad provides a feature-specific adaptive learning rate by incorporating knowledge of the geometry of past observations 6 3

Why Adapt to Geometry? Why adapt to geometry? Hard Hard x x x y t y t t, t, t, t, t,3 t,3 0 0 0 0 - -.5.5 0 0 -.5 -.5 0 0 - -0 0 0 0 0 0.5.5 0 0 0 0 - - 0 0 0 0 - - 0 0 - --.5 -.5 0 0 Examples from Duchi et al. ISMP 0 slides Nice Nice Frequent, Frequent, irrelevant irrelevant Infrequent, Infrequent, predictive predictive 3 Infrequent, 3 Infrequent, predictive predictive 7 Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 0 8 / 3 Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 0 8 / 3 Not All Features are Created Equal Motivation Examples: Text data: The most unsung birthday in American business and technological history this year may be the 50th anniversary of the Xerox 94 photocopier. a High-dimensional image features a The Atlantic, July/August 00. Images from Duchi et al. ISMP 0 slides Duchi et al. (UC Berkeley) Adaptive Emily Subgradient Fox 04 Methods ISMP 0 3 / 3 8 4

Projected Gradient w (t+) i w (t) i g t,i Brief aside Consider an arbitrary feature space w W w W If, can use projected gradient for (sub)gradient descent w (t+) = 9 Regret Minimization How do we assess the performance of an online algorithm? Algorithm iteratively predicts w (t) Incur loss f t (w (t) ) Regret: What is the total incurred loss of algorithm relative to the best choice of that could have been made retrospectively w R(T )= TX f t (w (t) ) inf ww TX f t (w) 30 5

Regret Bounds for Standard SGD Standard projected gradient stochastic updates: w (t+) = arg min ww w (w(t) g t ) Standard regret bound: TX f t (w (t) ) f t (w ) apple w() w + TX g t 3 Projected Gradient using Mahalanobis Standard projected gradient stochastic updates: w (t+) = arg min ww w (w(t) g t ) What if instead of an L metric for projection, we considered the Mahalanobis norm w (t+) = arg min ww w (w(t) A g t ) A 3 6

Mahalanobis Regret Bounds w (t+) = arg min ww w (w(t) A g t ) A What A to choose? TX Regret bound now: f t (w (t) ) f t (w ) apple w() w A + What if we minimize upper bound on regret w.r.t. A in hindsight? min A TX gt,a g t TX g t A 33 Mahalanobis Regret Minimization Objective: min A Solution: TX gt,a g t A = c TX g t gt T! subject to A 0, tr(a) apple C For proof, see Appendix E, Lemma 5 of Duchi et al. 0. Uses trace trick and Lagrangian. A defines the norm of the metric space we should be operating in 34 7

AdaGrad Algorithm w (t+) = arg min ww w (w(t) A g t ) A At time t, estimate optimal (sub)gradient modification A by A t = tx g g T =! For d large, A t is computationally intensive to compute. Instead, Then, algorithm is a simple modification of normal updates: w (t+) = arg min ww w (w(t) diag(a t ) g t ) diag(a t ) 35 AdaGrad in Euclidean Space For W = R d, For each feature dimension, where w (t+) i w (t) i t,i = t,i g t,i That is, Each feature dimension has it s own learning rate! Adapts with t w (t+) i w (t) i q Pt = g,i g t,i Takes geometry of the past observations into account Primary role of η is determining rate the first time a feature is encountered 36 8

AdaGrad Theoretical Guarantees AdaGrad regret bound: TX dx f t (w (t) ) f t (w ) apple R g :T,j i= R := max w (t) w t So, what does this mean in practice? Many cool examples. This really is used in practice! Let s just examine one 37 AdaGrad Theoretical Example Expect to out-perform when gradient vectors are sparse SVM hinge loss example: f t (w) =[ y t x t, w ] + x t {, 0, } d where E " If x t j 0 with probability / j, > f T!# TX w (t) " Previously best!# known method: TX E f w (t) f(w )=O T w f(w )=O p max{log d, d / } T w p T pd 38 9

Neural Network Learning Neural Network Neural Network LearningLearning Neural Network Learning Wildly non-convex problem: n f (x; ) = log ( + exp (h[p(hx, i) Wildly non-convex problem: problem: Wildly non-convex Very non-convex problem, but use SGD methods anyway p(hxk, k i)], 0 i)) f (x; ) = log ( + exp (h[p(hx, i) p(hxk, k i)], 0 i)) f (x; ) = log ( + exp (h[p(hx where, i) p(hxk, k i)], 0 i)) Neural Network Learning p( ) = Accuracy on Test Set where Average Frame Accuracy (%) 5 + exp( ) p(hx, i) where p( ) = 0 + exp( ) 5 p( ) = 0 + exp( ) i) x4 xduchixetal. (UCxBerkeley) 3 p(hx, i) SGD GPU Downpour SGD Downpour SGD w/adagrad Sandblaster L BFGS 5 0 0 p(hx, 0 40 60 80 00 3 5 x5 x x x3 x4 x5 3 5 4 Adaptive Subgradient Methods x 4 x ISMP 0 x3 x4 x5 3 5 4 5 / 3 0 Time (hours) (Dean etadaptive al. 0) Subgradient Methods Duchi et al. (UC Berkeley) ISMP 0 5 / 3 Images from Duchi et Distributed, d =.7 0 parameters. SGD and AdaGrad use 80 Duchi et al. (UC Berkeley) ISMP 0 machines (000 cores), L-BFGS uses 800 (0000Adaptive cores)subgradient Methods al. ISMP 0 slides 9 Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 0 6 / 3 5 / 3 39 What you should know about Logistic Regression (LR) and Click Prediction n Click prediction problem: n n n n Estimate probability of clicking Can be modeled as logistic regression Logistic regression model: Linear model Gradient ascent to optimize conditional likelihood Overfitting + regularization Regularized optimization Convergence n Stochastic gradient ascent for large/streaming data Convergence n rates and stopping criterion rates of SGD AdaGrad motivation, derivation, and algorithm 40 0