Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Similar documents
Ad Placement Strategies

Classification Logistic Regression

Logistic Regression Logistic

Stochastic Gradient Descent

Ad Placement Strategies

Convergence rate of SGD

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Linear classifiers: Overfitting and regularization

Bias-Variance Tradeoff

Announcements Kevin Jamieson

Machine Learning for NLP

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Overfitting, Bias / Variance Analysis

Warm up: risk prediction with logistic regression

Machine Learning, Fall 2012 Homework 2

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning

Classification Logistic Regression

Logistic Regression. Machine Learning Fall 2018

Is the test error unbiased for these programs?

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Linear classifiers: Logistic regression

ECE 5984: Introduction to Machine Learning

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Linear Regression (continued)

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Regression with Numerical Optimization. Logistic

Lecture 5: Logistic Regression. Neural Networks

Logistic Regression. Stochastic Gradient Descent

CS60021: Scalable Data Mining. Large Scale Machine Learning

Introduction to Logistic Regression and Support Vector Machine

Support Vector Machines

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Kernelized Perceptron Support Vector Machines

Adaptive Gradient Methods AdaGrad / Adam

ECE 5424: Introduction to Machine Learning

Machine Learning Gaussian Naïve Bayes Big Picture

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Machine Learning for NLP

Logistic Regression & Neural Networks

ECS171: Machine Learning

Introduction to Machine Learning

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Machine Learning Lecture 7

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Machine Learning (CSE 446): Neural Networks

Linear Models for Classification

Linear Models in Machine Learning

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Announcements. Proposals graded

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Linear Models for Regression

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. William Cohen

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Classification Based on Probability

Variations of Logistic Regression with Stochastic Gradient Descent

Lecture 2 Machine Learning Review

Announcements - Homework

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Classification: Logistic Regression from Data

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Introduction to Logistic Regression

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Deep Learning for Computer Vision

10-701/ Machine Learning - Midterm Exam, Fall 2010

Tufts COMP 135: Introduction to Machine Learning

Online Advertising is Big Business

ECE 5984: Introduction to Machine Learning

ECS171: Machine Learning

Generative v. Discriminative classifiers Intuition

Introduction to Machine Learning Midterm, Tues April 8

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Big Data Analytics. Lucas Rego Drumond

Machine Learning (CSE 446): Unsupervised Learning: K-means and Principal Component Analysis

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

EE 511 Online Learning Perceptron

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Classification Logistic Regression

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

ECE521 week 3: 23/26 January 2017

Online Advertising is Big Business

Transcription:

Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements: Project Proposals: due this Friday! One page HW1 posted today. (starting NEXT week) TA office hours Readings: please do them. Today: Review: logistic regression, GD, SGD Hashing and Sketching Kakade 017 1

Machine Learning for Big Data (CSE 547 / STAT 548) ( what is big data anyways?) Ad Placement Strategies Companies bid on ad prices Which ad wins? (many simplifications here) Naively: But: Instead: 4

Learning Problem for Click Prediction Prediction task: Features: Data: Batch: Online: Many approaches (e.g., logistic regression, SVMs, naïve Bayes, decision trees, boosting, ) Focus on logistic regression; captures main concepts, ideas generalize to other approaches 5 Logistic Regression n Learn P(Y X) directly Assume a particular functional form Sigmoid applied to a linear function of the data: Logistic function (or Sigmoid): Z Features can be discrete or continuous! 6 3

Maximizing Conditional Log Likelihood = X j! dx dx y j (w 0 + w i x j i ) ln 1+exp(w 0 + w i x j i ) i=1 i=1 Good news: l(w) is concave function of w, no local optima problems Bad news: no closed-form solution to maximize l(w) Good news: concave functions easy to optimize 7 Gradient Ascent for LR Gradient ascent algorithm: iterate until change < e (t) For i = 1,,d, (t) repeat 8 4

Regularized Conditional Log Likelihood If data are linearly separable, weights go to infinity Leads to overfitting à Penalize large weights Add regularization penalty, e.g., L : `(w) =ln Y j P (y j x j, w)) w Practical note about w 0 : 9 Standard v. Regularized Updates Maximum conditional likelihood estimate (t) Regularized maximum conditional likelihood estimate 3 w = arg max ln 4 Y X P (y j x j, w)) 5 wi w j i>0 (t) 10 5

Stopping criterion `(w) =ln Y j P (y j x j, w)) w Regularized logistic regression is strongly concave Negative second derivative bounded away from zero: Strong concavity (convexity) is super helpful!! For example, for strongly concave l(w): `(w ) `(w) apple 1 r`(w) 11 Convergence rates for gradient descent/ascent Number of iterations to get to accuracy `(w ) If func l(w) Lipschitz: O(1/ϵ ) `(w) apple If gradient of func Lipschitz: O(1/ϵ) If func is strongly convex: O(ln(1/ϵ)) 1 6

Challenge 1: Complexity of computing gradients What s the cost of a gradient update step for LR??? (t) 13 Challenge : Data is streaming Assumption thus far: Batch data But, click prediction is a streaming data task: User enters query, and ad must be selected: Observe x j, and must predict y j User either clicks or doesn t click on ad: Label y j is revealed afterwards Google gets a reward if user clicks on ad Weights must be updated for next time: 14 7

Learning Problems as Expectations Minimizing loss in training data: Given dataset: Sampled iid from some distribution p(x) on features: Loss function, e.g., hinge loss, logistic loss, We often minimize loss in training data: `D(w) = 1 NX `(w, x j ) N j=1 However, we should really minimize expected loss on all data: Z `(w) =E x [`(w, x)] = p(x)`(w, x)dx So, we are approximating the integral by the average on the training data 15 Gradient Ascent in Terms of Expectations True objective function: `(w) =E x [`(w, x)] = Taking the gradient: Z p(x)`(w, x)dx True gradient ascent rule: How do we estimate expected gradient? 16 8

SGD: Stochastic Gradient Ascent (or Descent) True gradient: r`(w) =E x [r`(w, x)] Sample based approximation: What if we estimate gradient with just one sample??? Unbiased estimate of gradient Very noisy! Called stochastic gradient ascent (or descent) Among many other names VERY useful in practice!!! 17 Stochastic Gradient Ascent: General Case Given a stochastic function of parameters: Want to find maximum Start from w (0) Repeat until convergence: Get a sample data point x t Update parameters: Works in the online learning setting! Complexity of each gradient step is constant in number of examples! In general, step size changes with iterations 18 9

Stochastic Gradient Ascent for Logistic Regression Logistic loss as a stochastic function: E x [`(w, x)] = E x ln P (y x, w) Batch gradient ascent updates: 8 < w (t+1) i w (t) i + : w(t) i + 1 NX N Stochastic gradient ascent updates: Online setting: n w (t+1) i w (t) i + t w (t) i j=1 9 = i [y (j) P (Y =1 x (j), w (t) )] ; x (j) w o + x (t) i [y (t) P (Y =1 x (t), w (t) )] 19 Theorem: Convergence Rate of SGD (see CSE546 notes and readings) Let f be a strongly convex stochastic function Assume gradient of f is Lipschitz continuous Then, for step sizes: The expected loss decreases as O(1/t^0.5): 0 10

Convergence Rates for Gradient Descent/Ascent vs. SGD Number of Iterations to get to accuracy Gradient descent: If func is strongly convex: O(ln(1/ϵ)) iterations Stochastic gradient descent: If func is strongly convex: O(1/ϵ) iterations Seems exponentially worse, but much more subtle: Total running time, e.g., for logistic regression: Gradient descent: SGD: `(w ) `(w) apple SGD can win when we have a lot of data See readings for more details 1 What you should know about Logistic Regression (LR) and Click Prediction Click prediction problem: Estimate probability of clicking Can be modeled as logistic regression Logistic regression model: Linear model Gradient ascent to optimize conditional likelihood Overfitting + regularization Regularized optimization Convergence rates and stopping criterion Stochastic gradient ascent for large/streaming data Convergence rates of SGD 11