Ad Placement Strategies

Size: px
Start display at page:

Download "Ad Placement Strategies"

Transcription

1 Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad Placement Strategies Companies bid on ad prices Which ad wins? (many simplifications here) Naively: But: Instead:

2 Key Task: Estimating Click Probabilities What is the probability that user i will click on ad j Not important just for ads: Optimize search results Suggest news articles Recommend products Methods much more general, useful for: Classification Regression Density estimation 3 Learning Problem for Click Prediction Prediction task: Features: Data: Batch: Online: Many approaches (e.g., logistic regression, SVMs, naïve Bayes, decision trees, boosting, ) Focus on logistic regression; captures main concepts, ideas generalize to other approaches 4

3 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular functional form Sigmoid applied to a linear function of the data: Z Features can be discrete or continuous! 5 Very convenient! 0 implies 0 linear classification rule! 6 3

4 Digression: Logistic regression more generally Logistic regression in more general case, where Y in {y,,y R } for k<r for k=r (normalization, so no weights for this class) Features can be discrete or continuous! 7 Loss function: Conditional Likelihood Have a bunch of iid data of the form: Discriminative (logistic regression) loss function: Conditional Data Likelihood 8 4

5 Expressing Conditional Log Likelihood `(w) = X j = X j y j ln P (Y = x j, w)+( y j )lnp(y =0 x j, w)! dx dx y j (w 0 + w i x j i ) ln +exp(w 0 + w i x j i ) i= i= 9 Maximizing Conditional Log Likelihood = X j! dx dx y j (w 0 + w i x j i ) ln +exp(w 0 + w i x j i ) i= i= Good news: l(w) is concave function of w, no local optima problems Bad news: no closed-form solution to maximize l(w) Good news: concave functions easy to optimize 0 5

6 Optimizing concave function Gradient ascent Conditional likelihood for Logistic Regression is concave. Find optimum with gradient ascent Gradient: Step size, η>0 Update rule: Gradient ascent is simplest of optimization approaches e.g., Conjugate gradient ascent much better (see reading) Gradient Ascent for LR Gradient ascent algorithm: iterate until change < ε (t) For i =,,d, (t) repeat 6

7 Regularized Conditional Log Likelihood If data is linearly separable, weights go to infinity Leads to overfitting à Penalize large weights Add regularization penalty, e.g., L : `(w) =ln Y j P (y j x j, w)) w Practical note about w 0 : 3 Standard v. Regularized Updates Maximum conditional likelihood estimate (t) Regularized maximum conditional likelihood estimate 3 w = arg max ln 4 Y X P (y j x j, w)) 5 wi w j i>0 (t) 4 7

8 Stopping criterion `(w) =ln Y j P (y j x j, w)) w Regularized logistic regression is strongly concave Negative second derivative bounded away from zero: Strong concavity (convexity) is super helpful!! For example, for strongly concave l(w): `(w ) `(w) apple r`(w) 5 Convergence rates for gradient descent/ascent Number of Iterations to get to accuracy `(w ) If func Lipschitz: O(/ϵ ) `(w) apple If gradient of func Lipschitz: O(/ϵ) If func is strongly convex: O(ln(/ϵ)) 6 8

9 Challenge : Complexity of computing gradients What s the cost of a gradient update step for LR??? (t) 7 Challenge : Data is streaming Assumption thus far: Batch data But, click prediction is a streaming data task: User enters query, and ad must be selected: Observe x j, and must predict y j User either clicks or doesn t click on ad: Label y j is revealed afterwards Google gets a reward if user clicks on ad Weights must be updated for next time: 8 9

10 Learning Problems as Expectations Minimizing loss in training data: Given dataset: Sampled iid from some distribution p(x) on features: Loss function, e.g., hinge loss, logistic loss, We often minimize loss in training data: `D(w) = N NX `(w, x j ) j= However, we should really minimize expected loss on all data: Z `(w) =E x [`(w, x)] = p(x)`(w, x)dx So, we are approximating the integral by the average on the training data 9 Gradient ascent in Terms of Expectations True objective function: `(w) =E x [`(w, x)] = Z p(x)`(w, x)dx Taking the gradient: True gradient ascent rule: How do we estimate expected gradient? 0 0

11 SGD: Stochastic Gradient Ascent (or Descent) True gradient: r`(w) =E x [r`(w, x)] Sample based approximation: What if we estimate gradient with just one sample??? Unbiased estimate of gradient Very noisy! Called stochastic gradient ascent (or descent) Among many other names VERY useful in practice!!! Stochastic Gradient Ascent: general case Given a stochastic function of parameters: Want to find maximum Start from w (0) Repeat until convergence: Get a sample data point x t Update parameters: Works on the online learning setting! Complexity of each gradient step is constant in number of examples! In general, step size changes with iterations

12 Stochastic Gradient Ascent for Logistic Regression Logistic loss as a stochastic function: E x [`(w, x)] = E x ln P (y x, w) Batch gradient ascent updates: w (t+) i w (t) i 8 < + : w(t) i + N Stochastic gradient ascent updates: Online setting: n w (t+) i w (t) i + t w (t) i NX j= x (j) w 9 = i [y (j) P (Y = x (j), w (t) )] ; o + x (t) i [y (t) P (Y = x (t), w (t) )] 3 Convergence rate of SGD Theorem: (see Nemirovski et al 09 from readings) Let f be a strongly convex stochastic function Assume gradient of f is Lipschitz continuous and bounded Then, for step sizes: The expected loss decreases as O(/t): 4

13 Convergence rates for gradient descent/ascent versus SGD Number of Iterations to get to accuracy `(w ) `(w) apple Gradient descent: If func is strongly convex: O(ln(/ϵ)) iterations Stochastic gradient descent: If func is strongly convex: O(/ϵ) iterations Seems exponentially worse, but much more subtle: Total running time, e.g., for logistic regression: Gradient descent: SGD: SGD can win when we have a lot of data And, when analyzing true error, situation even more subtle expected running time about the same, see readings 5 Motivating AdaGrad (Duchi, Hazan, Singer 0) w R d Assuming, standard stochastic (sub)gradient descent updates are of the form: g t,i w (t+) i w (t) i Should all features share the same learning rate? Often have high-dimensional feature spaces Many features are irrelevant Rare features are often very informative Adagrad provides a feature-specific adaptive learning rate by incorporating knowledge of the geometry of past observations 6 3

14 Why Adapt to Geometry? Why adapt to geometry? Hard Hard x x x y t y t t, t, t, t, t,3 t, Examples from Duchi et al. ISMP 0 slides Nice Nice Frequent, Frequent, irrelevant irrelevant Infrequent, Infrequent, predictive predictive 3 Infrequent, 3 Infrequent, predictive predictive 7 Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 0 8 / 3 Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 0 8 / 3 Not All Features are Created Equal Motivation Examples: Text data: The most unsung birthday in American business and technological history this year may be the 50th anniversary of the Xerox 94 photocopier. a High-dimensional image features a The Atlantic, July/August 00. Images from Duchi et al. ISMP 0 slides Duchi et al. (UC Berkeley) Adaptive Emily Subgradient Fox 04 Methods ISMP 0 3 / 3 8 4

15 Projected Gradient w (t+) i w (t) i g t,i Brief aside Consider an arbitrary feature space w W w W If, can use projected gradient for (sub)gradient descent w (t+) = 9 Regret Minimization How do we assess the performance of an online algorithm? Algorithm iteratively predicts w (t) Incur loss f t (w (t) ) Regret: What is the total incurred loss of algorithm relative to the best choice of that could have been made retrospectively w R(T )= TX f t (w (t) ) inf ww TX f t (w) 30 5

16 Regret Bounds for Standard SGD Standard projected gradient stochastic updates: w (t+) = arg min ww w (w(t) g t ) Standard regret bound: TX f t (w (t) ) f t (w ) apple w() w + TX g t 3 Projected Gradient using Mahalanobis Standard projected gradient stochastic updates: w (t+) = arg min ww w (w(t) g t ) What if instead of an L metric for projection, we considered the Mahalanobis norm w (t+) = arg min ww w (w(t) A g t ) A 3 6

17 Mahalanobis Regret Bounds w (t+) = arg min ww w (w(t) A g t ) A What A to choose? TX Regret bound now: f t (w (t) ) f t (w ) apple w() w A + What if we minimize upper bound on regret w.r.t. A in hindsight? min A TX gt,a g t TX g t A 33 Mahalanobis Regret Minimization Objective: min A Solution: TX gt,a g t A = c TX g t gt T! subject to A 0, tr(a) apple C For proof, see Appendix E, Lemma 5 of Duchi et al. 0. Uses trace trick and Lagrangian. A defines the norm of the metric space we should be operating in 34 7

18 AdaGrad Algorithm w (t+) = arg min ww w (w(t) A g t ) A At time t, estimate optimal (sub)gradient modification A by A t = tx g g T =! For d large, A t is computationally intensive to compute. Instead, Then, algorithm is a simple modification of normal updates: w (t+) = arg min ww w (w(t) diag(a t ) g t ) diag(a t ) 35 AdaGrad in Euclidean Space For W = R d, For each feature dimension, where w (t+) i w (t) i t,i = t,i g t,i That is, Each feature dimension has it s own learning rate! Adapts with t w (t+) i w (t) i q Pt = g,i g t,i Takes geometry of the past observations into account Primary role of η is determining rate the first time a feature is encountered 36 8

19 AdaGrad Theoretical Guarantees AdaGrad regret bound: TX dx f t (w (t) ) f t (w ) apple R g :T,j i= R := max w (t) w t So, what does this mean in practice? Many cool examples. This really is used in practice! Let s just examine one 37 AdaGrad Theoretical Example Expect to out-perform when gradient vectors are sparse SVM hinge loss example: f t (w) =[ y t x t, w ] + x t {, 0, } d where E " If x t j 0 with probability / j, > f T!# TX w (t) " Previously best!# known method: TX E f w (t) f(w )=O T w f(w )=O p max{log d, d / } T w p T pd 38 9

20 Neural Network Learning Neural Network Neural Network LearningLearning Neural Network Learning Wildly non-convex problem: n f (x; ) = log ( + exp (h[p(hx, i) Wildly non-convex problem: problem: Wildly non-convex Very non-convex problem, but use SGD methods anyway p(hxk, k i)], 0 i)) f (x; ) = log ( + exp (h[p(hx, i) p(hxk, k i)], 0 i)) f (x; ) = log ( + exp (h[p(hx where, i) p(hxk, k i)], 0 i)) Neural Network Learning p( ) = Accuracy on Test Set where Average Frame Accuracy (%) 5 + exp( ) p(hx, i) where p( ) = 0 + exp( ) 5 p( ) = 0 + exp( ) i) x4 xduchixetal. (UCxBerkeley) 3 p(hx, i) SGD GPU Downpour SGD Downpour SGD w/adagrad Sandblaster L BFGS p(hx, x5 x x x3 x4 x Adaptive Subgradient Methods x 4 x ISMP 0 x3 x4 x / 3 0 Time (hours) (Dean etadaptive al. 0) Subgradient Methods Duchi et al. (UC Berkeley) ISMP 0 5 / 3 Images from Duchi et Distributed, d =.7 0 parameters. SGD and AdaGrad use 80 Duchi et al. (UC Berkeley) ISMP 0 machines (000 cores), L-BFGS uses 800 (0000Adaptive cores)subgradient Methods al. ISMP 0 slides 9 Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 0 6 / 3 5 / 3 39 What you should know about Logistic Regression (LR) and Click Prediction n Click prediction problem: n n n n Estimate probability of clicking Can be modeled as logistic regression Logistic regression model: Linear model Gradient ascent to optimize conditional likelihood Overfitting + regularization Regularized optimization Convergence n Stochastic gradient ascent for large/streaming data Convergence n rates and stopping criterion rates of SGD AdaGrad motivation, derivation, and algorithm 40 0

Convergence rate of SGD

Convergence rate of SGD Convergence rate of SGD heorem: (see Nemirovski et al 09 from readings) Let f be a strongly convex stochastic function Assume gradient of f is Lipschitz continuous and bounded hen, for step sizes: he expected

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

Adaptive Gradient Methods AdaGrad / Adam

Adaptive Gradient Methods AdaGrad / Adam Case Study 1: Estimating Click Probabilities Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 The Problem with GD (and SGD)

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Logistic Regression Logistic

Logistic Regression Logistic Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up. Regrade requests submitted directly in Gradescope, do not  instructors. Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required

More information

Linear classifiers: Overfitting and regularization

Linear classifiers: Overfitting and regularization Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features Y target

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

Notes on AdaGrad. Joseph Perla 2014

Notes on AdaGrad. Joseph Perla 2014 Notes on AdaGrad Joseph Perla 2014 1 Introduction Stochastic Gradient Descent (SGD) is a common online learning algorithm for optimizing convex (and often non-convex) functions in machine learning today.

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 1 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features

More information

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com Logistic Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these

More information

Classification Logistic Regression

Classification Logistic Regression Classification Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 1 THUS FAR, REGRESSION: PREDICT A CONTINUOUS VALUE GIVEN SOME INPUTS 2 Weather prediction

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Classification: Logistic Regression from Data

Classification: Logistic Regression from Data Classification: Logistic Regression from Data Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3 Slides adapted from Emily Fox Machine Learning: Jordan Boyd-Graber Boulder Classification:

More information

Is the test error unbiased for these programs?

Is the test error unbiased for these programs? Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland LOGISTIC REGRESSION FROM TEXT Slides adapted from Emily Fox Machine Learning: Jordan Boyd-Graber UMD Introduction

More information

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012 Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)

More information

Classification: Logistic Regression from Data

Classification: Logistic Regression from Data Classification: Logistic Regression from Data Machine Learning: Alvin Grissom II University of Colorado Boulder Slides adapted from Emily Fox Machine Learning: Alvin Grissom II Boulder Classification:

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 22 Dan Weld Learning Gaussians Naïve Bayes Last Time Gaussians Naïve Bayes Logistic Regression Today Some slides from Carlos Guestrin, Luke Zettlemoyer

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Announcements Kevin Jamieson

Announcements Kevin Jamieson Announcements Project proposal due next week: Tuesday 10/24 Still looking for people to work on deep learning Phytolith project, join #phytolith slack channel 2017 Kevin Jamieson 1 Gradient Descent Machine

More information

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

More information

Linear classifiers: Logistic regression

Linear classifiers: Logistic regression Linear classifiers: Logistic regression STAT/CSE 416: Machine Learning Emily Fox University of Washington April 19, 2018 How confident is your prediction? The sushi & everything else were awesome! The

More information

Kernelized Perceptron Support Vector Machines

Kernelized Perceptron Support Vector Machines Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Machine Learning Gaussian Naïve Bayes Big Picture

Machine Learning Gaussian Naïve Bayes Big Picture Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 27, 2011 Today: Naïve Bayes Big Picture Logistic regression Gradient ascent Generative discriminative

More information

Logistic Regression & Neural Networks

Logistic Regression & Neural Networks Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2 Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 2, 2015 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17 3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan Boulder 1 of 27 Quiz question For

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning. Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research

More information

Logistic Regression. Stochastic Gradient Descent

Logistic Regression. Stochastic Gradient Descent Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

Classification Logistic Regression

Classification Logistic Regression O due Thursday µtwl Classification Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 1 THUS FAR, REGRESSION: PREDICT A CONTINUOUS VALUE GIVEN SOME INPUTS

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Nathan Schneider (some slides borrowed from Chris Dyer) ENLP 12 February 2018 23 Outline Words, probabilities Features,

More information

Introduction to Logistic Regression

Introduction to Logistic Regression Introduction to Logistic Regression Guy Lebanon Binary Classification Binary classification is the most basic task in machine learning, and yet the most frequent. Binary classifiers often serve as the

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Classification: Logistic Regression NB & LR connections Readings: Barber 17.4 Dhruv Batra Virginia Tech Administrativia HW2 Due: Friday 3/6, 3/15, 11:55pm

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Classification Based on Probability

Classification Based on Probability Logistic Regression These slides were assembled by Byron Boots, with only minor modifications from Eric Eaton s slides and grateful acknowledgement to the many others who made their course materials freely

More information

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Is the test error unbiased for these programs? 2017 Kevin Jamieson Is the test error unbiased for these programs? 2017 Kevin Jamieson 1 Is the test error unbiased for this program? 2017 Kevin Jamieson 2 Simple Variable Selection LASSO: Sparse Regression Machine Learning

More information

Logistic Regression. William Cohen

Logistic Regression. William Cohen Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4 Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 The Regression Problem Training data: A set of input-output

More information

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run

More information

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012) Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation

More information

Optimization and Gradient Descent

Optimization and Gradient Descent Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function

More information

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

CSC 411: Lecture 04: Logistic Regression

CSC 411: Lecture 04: Logistic Regression CSC 411: Lecture 04: Logistic Regression Raquel Urtasun & Rich Zemel University of Toronto Sep 23, 2015 Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 1 / 16 Today Key Concepts: Logistic

More information

Stochastic Subgradient Method

Stochastic Subgradient Method Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Optimal and Adaptive Online Learning

Optimal and Adaptive Online Learning Optimal and Adaptive Online Learning Haipeng Luo Advisor: Robert Schapire Computer Science Department Princeton University Examples of Online Learning (a) Spam detection 2 / 34 Examples of Online Learning

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information