Stochastic Gradient Descent

Similar documents
Voting (Ensemble Methods)

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Classification Logistic Regression

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

ECE 5424: Introduction to Machine Learning

VBM683 Machine Learning

Boos$ng Can we make dumb learners smart?

ECE 5984: Introduction to Machine Learning

Logistic Regression Logistic

Generative v. Discriminative classifiers Intuition

Bias-Variance Tradeoff

CS7267 MACHINE LEARNING

Ad Placement Strategies

Generative v. Discriminative classifiers Intuition

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Learning Theory Continued

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Linear classifiers: Overfitting and regularization

Machine Learning

Decision Trees: Overfitting

Machine Learning. Ensemble Methods. Manfred Huber

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

ECE 5984: Introduction to Machine Learning

Logistic Regression. Machine Learning Fall 2018

Ad Placement Strategies

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning Gaussian Naïve Bayes Big Picture

Generative v. Discriminative classifiers Intuition

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

A Brief Introduction to Adaboost

Midterm, Fall 2003

Support Vector Machines

PAC-learning, VC Dimension and Margin-based Bounds

Machine Learning, Fall 2012 Homework 2

Kernelized Perceptron Support Vector Machines

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Learning with multiple models. Boosting.

Ensemble Methods for Machine Learning

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning

Is the test error unbiased for these programs?

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

Learning Ensembles. 293S T. Yang. UCSB, 2017.

10-701/ Machine Learning - Midterm Exam, Fall 2010

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Announcements Kevin Jamieson

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

PAC-learning, VC Dimension and Margin-based Bounds

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Accelerating Stochastic Optimization

Announcements Kevin Jamieson

Gradient Boosting (Continued)

Introduction to Machine Learning

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Warm up: risk prediction with logistic regression

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Logistic Regression. COMP 527 Danushka Bollegala

Machine Learning Lecture 7

Introduction to Machine Learning (67577) Lecture 7

Machine Learning (CSE 446): Neural Networks

Machine Learning for NLP

Introduction to Logistic Regression and Support Vector Machine

CSCI-567: Machine Learning (Spring 2019)

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Midterm: CS 6375 Spring 2015 Solutions

10701/15781 Machine Learning, Spring 2007: Homework 2

Bias-Variance in Machine Learning

Support Vector Machines

Week 5: Logistic Regression & Neural Networks

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Introduction to Machine Learning

10-701/ Machine Learning, Fall

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Introduction to Bayesian Learning. Machine Learning Fall 2018

Linear and Logistic Regression. Dr. Xiaowei Huang

Announcements. Proposals graded

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Linear classifiers: Logistic regression

FINAL: CS 6375 (Machine Learning) Fall 2014

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Transcription:

Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular functional form for link function Sigmoid applied to a linear function of the input features: Z Features can be discrete or continuous! 2 1

Optimizing concave function Gradient ascent Conditional likelihood for Logistic Regression is concave. Find optimum with gradient ascent Gradient: Step size, η>0 Update rule: Gradient ascent is simplest of optimization approaches e.g., Conjugate gradient ascent can be much better 3 Gradient Ascent for LR Gradient ascent algorithm: iterate until change < ε (t) For i=1,,k, (t) repeat 4 2

The Cost, The Cost!!! Think about the cost What s the cost of a gradient update step for LR??? (t) 5 Learning Problems as Expectations Minimizing loss in training data: Given dataset: Sampled iid from some distribution p(x) on features: Loss function, e.g., hinge loss, logistic loss, We often minimize loss in training data: `D(w) = 1 N `(w, x j ) However, we should really minimize expected loss on all data: Z `(w) =E x [`(w, x)] = p(x)`(w, x)dx So, we are approximating the integral by the average on the training data 6 3

Gradient ascent in Terms of Expectations True objective function: `(w) =E x [`(w, x)] = Z p(x)`(w, x)dx Taking the gradient: True gradient ascent rule: How do we estimate expected gradient? 7 SGD: Stochastic Gradient Ascent (or Descent) True gradient: r`(w) =E x [r`(w, x)] Sample based approximation: What if we estimate gradient with just one sample??? Unbiased estimate of gradient Very noisy! Called stochastic gradient ascent (or descent) Among many other names VERY useful in practice!!! 8 4

Stochastic Gradient Ascent for Logistic Regression Logistic loss as a stochastic function: E x [`(w, x)] = E x ln P (y x, w) w 2 2 Batch gradient ascent updates: w (t+1) i w (t) i 8 < + : w(t) i + 1 N Stochastic gradient ascent updates: Online setting: n w (t+1) i w (t) i + t w (t) i 9 = i [y (j) P (Y =1 x (j), w (t) )] ; x (j) o + x (t) i [y (t) P (Y =1 x (t), w (t) )] 9 Stochastic Gradient Ascent: general case Given a stochastic function of parameters: Want to find maximum Start from w (0) Repeat until convergence: Get a sample data point x t Update parameters: Works on the online learning setting! Complexity of each gradient step is constant in number of examples! In general, step size changes with iterations 10 5

What you should know Classification: predict discrete classes rather than real values Logistic regression model: Linear model Logistic function maps real values to [0,1] Optimize conditional likelihood Gradient computation Overfitting Regularization Regularized optimization Cost of gradient step is high, use stochastic gradient descent 11 Boosting Machine Learning CSE546 Carlos Guestrin University of Washington October 14, 2013 12 6

Fighting the bias-variance tradeoff Simple (a.k.a. weak) learners are good e.g., naïve Bayes, logistic regression, decision stumps (or shallow decision trees) Low variance, don t usually overfit too badly Simple (a.k.a. weak) learners are bad High bias, can t solve hard learning problems Can we make weak learners always good??? No!!! But often yes 13 Voting (Ensemble Methods) Instead of learning a single (weak) classifier, learn many weak classifiers that are good at different parts of the input space Output class: (Weighted) vote of each classifier Classifiers that are most sure will vote with more conviction Classifiers will be most sure about a particular part of the space On average, do better than single classifier! But how do you??? force classifiers to learn about different parts of the input space? weigh the votes of different classifiers? 14 7

Boosting [Schapire, 1989] Idea: given a weak learner, run it multiple times on (reweighted) training data, then let learned classifiers vote On each iteration t: weight each training example by how incorrectly it was classified Learn a hypothesis h t A strength for this hypothesis α t Final classifier: Practically useful Theoretically interesting 15 Learning from weighted data Sometimes not all data points are equal Some data points are more equal than others Consider a weighted dataset D(j) weight of j th training example (x j,y j ) Interpretations: j th training example counts as D(j) examples If I were to resample data, I would get more samples of heavier data points Now, in all calculations, whenever used, j th training example counts as D(j) examples 16 8

AdaBoost Initialize weights to uniform dist: D 1 (j) = 1/N For t = 1 T Train weak learner h t on distribution D t over the data Choose weight α t Update weights: D t+1 (j) = D t(j)exp( t y j h t (x j )) Z t Where Z t is normalizer: Z t = D t (j)exp( t y j h t (x j )) Output final classifier: 17 Picking Weight of Weak Learner Weigh h t higher if it did well on training data (weighted by D t ): t = 1 1 2 ln t t Where ε t is the weighted training error: t = D t (j) [h t (x j ) 6= y j ] 18 9

Why choose α t for hypothesis h t this way? Training error of final classifier is bounded by: 1 [H(x j ) 6= y j ] apple 1 exp( y j f(x j )) N N Where [Schapire, 1989] 19 Why choose α t for hypothesis h t this way? Training error of final classifier is bounded by: 1 N Where [H(x j ) 6= y j ] apple 1 N exp( y j f(x j )) = Z t = [Schapire, 1989] D t (j)exp( t y j h t (x j )) TY t=1 Z t 20 10

Why choose α t for hypothesis h t this way? Training error of final classifier is bounded by: 1 N Where [H(x j ) 6= y j ] apple 1 N exp( y j f(x j )) = [Schapire, 1989] TY t=1 Z t If we minimize t Z t, we minimize our training error We can tighten this bound greedily, by choosing α t and h t on each iteration to minimize Z t. Z t = D t (j)exp( t y j h t (x j )) 21 Why choose α t for hypothesis h t this way? [Schapire, 1989] We can minimize this bound by choosing α t on each iteration to minimize Z t. Z t = D t (j)exp( t y j h t (x j )) For boolean target function, this is accomplished by [Freund & Schapire 97]: You ll prove this in your homework! J 22 11