Classification: Logistic Regression from Data

Similar documents
Introduction to Machine Learning

Classification: Logistic Regression from Data

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Logistic Regression. INFO-2301: Quantitative Reasoning 2 Michael Paul and Jordan Boyd-Graber SLIDES ADAPTED FROM HINRICH SCHÜTZE

Logistic Regression. Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul SLIDES ADAPTED FROM HINRICH SCHÜTZE

Solving Regression. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 12. Slides adapted from Matt Nedrich and Trevor Hastie

Ad Placement Strategies

Classification: The PAC Learning Framework

Regression. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 11. Jordan Boyd-Graber Boulder Regression 1 of 19

Ad Placement Strategies

Stochastic Gradient Descent

Logistic Regression. Stochastic Gradient Descent

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Logistic Regression Logistic

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Generative v. Discriminative classifiers Intuition

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning, Fall 2012 Homework 2

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Inexact Search is Good Enough

Bias-Variance Tradeoff

Introduction to Logistic Regression and Support Vector Machine

Variations of Logistic Regression with Stochastic Gradient Descent

Generative v. Discriminative classifiers Intuition

ECS171: Machine Learning

Logistic Regression Trained with Different Loss Functions. Discussion

Selected Topics in Optimization. Some slides borrowed from

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Probabilistic Machine Learning. Industrial AI Lab.

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Logistic Regression. COMP 527 Danushka Bollegala

Support Vector Machines

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

ECS171: Machine Learning

Machine Learning

Least Mean Squares Regression

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Introduction to Machine Learning

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Neural Networks: Backpropagation

Optimization and Gradient Descent

Optimization for Machine Learning

1 Review and Overview

Support Vector Machines

Regression with Numerical Optimization. Logistic

Classification Logistic Regression

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Logistic Regression. William Cohen

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Perceptron (Theory) + Linear Regression

Online Learning and Sequential Decision Making

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Logistic Regression. Machine Learning Fall 2018

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Generative v. Discriminative classifiers Intuition

Multivariate Normal Models

Introduction to Machine Learning

Linear Regression (continued)

Reward-modulated inference

Multivariate Normal Models

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

CS60021: Scalable Data Mining. Large Scale Machine Learning

Warm up: risk prediction with logistic regression

Big Data Analytics. Lucas Rego Drumond

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Probabilistic Machine Learning

Linear Models in Machine Learning

1 Review of Winnow Algorithm

ECE 5984: Introduction to Machine Learning

Classification Based on Probability

Introduction to Logistic Regression

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Convex Optimization Algorithms for Machine Learning in 10 Slides

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Learning From Data Lecture 9 Logistic Regression and Gradient Descent

Logistic Regression & Neural Networks

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Computational statistics

Bayesian Machine Learning

Homework 5. Convex Optimization /36-725

Linear Models for Regression

Naïve Bayes classification

Stochastic gradient descent; Classification

Machine Learning. 7. Logistic and Linear Regression

HOMEWORK #4: LOGISTIC REGRESSION

Neural Networks: Backpropagation

Multiclass Logistic Regression

Statistical Data Mining and Machine Learning Hilary Term 2016

Iterative Reweighted Least Squares

MLE/MAP + Naïve Bayes

Neural Network Training

Transcription:

Classification: Logistic Regression from Data Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3 Slides adapted from Emily Fox Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 1 of 15

Reminder: Logistic Regression 1 P(Y = 0 X) = 1 + exp β0 + i β (1) ix i P(Y = 1 X) = exp β0 + i β ix i 1 + exp β0 + i β (2) ix i Discriminative prediction: p(y x) Classification uses: ad placement, spam detection What we didn t talk about is how to learn β from data Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 2 of 15

Logistic Regression: Objective Function lnp(y X,β) = lnp(y (j) x (j),β) (3) j = y β (j) 0 + β i x (j) i ln 1 + exp β 0 + β i x (j) i j i i (4) Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 3 of 15

Convexity Convex function Doesn t matter where you start, if you walk up objective Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 4 of 15

Convexity Convex function Doesn t matter where you start, if you walk up objective Gradient! Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 4 of 15

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b Objective 0 Parameter Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b 0 Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b 0 Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b 0 1 Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b 0 1 Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b 0 1 2 Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b 0 1 2 Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b 0 1 2 3 Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b 0 1 2 3 Undiscovered Country Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b α 0 W J W ij (l) W ij (l) Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15

Gradient for Logistic Regression Gradient β ( β) = ( β) β 0,..., ( β) β n (5) Update β η β ( β) (6) β i β i + η ( β) β i (7) Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 6 of 15

Gradient for Logistic Regression Gradient β ( β) = ( β) β 0,..., ( β) β n (5) Update β η β ( β) (6) β i β i + η ( β) β i (7) Why are we adding? What would well do if we wanted to do descent? Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 6 of 15

Gradient for Logistic Regression Gradient β ( β) = ( β) β 0,..., ( β) β n (5) Update η: step size, must be greater than zero β η β ( β) (6) β i β i + η ( β) β i (7) Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 6 of 15

Gradient for Logistic Regression Gradient β ( β) = ( β) β 0,..., ( β) β n (5) Update β η β ( β) (6) β i β i + η ( β) β i (7) NB: Conjugate gradient is usually better, but harder to implement Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 6 of 15

Choosing Step Size Objective Parameter Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 7 of 15

Choosing Step Size Objective Parameter Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 7 of 15

Choosing Step Size Objective Parameter Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 7 of 15

Choosing Step Size Objective Parameter Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 7 of 15

Choosing Step Size Objective Parameter Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 7 of 15

Remaining issues When to stop? What if β keeps getting bigger? Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 8 of 15

Regularized Conditional Log Likelihood Unregularized Regularized β = argmax ln p(y (j) x (j),β) (8) β β = argmax ln p(y (j) x (j),β) µ β i β 2 i (9) Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 9 of 15

Regularized Conditional Log Likelihood Unregularized Regularized β = argmax ln p(y (j) x (j),β) (8) β β = argmax ln p(y (j) x (j),β) µ β i β 2 i (9) µ is regularization parameter that trades off between likelihood and having small parameters Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 9 of 15

Approximating the Gradient Our datasets are big (to fit into memory)... or data are changing / streaming Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 10 of 15

Approximating the Gradient Our datasets are big (to fit into memory)... or data are changing / streaming Hard to compute true gradient (β) x [ (β,x)] (10) Average over all observations Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 10 of 15

Approximating the Gradient Our datasets are big (to fit into memory)... or data are changing / streaming Hard to compute true gradient (β) x [ (β,x)] (10) Average over all observations What if we compute an update just from one observation? Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 10 of 15

Getting to Union Station Pretend it s a pre-smartphone world and you want to get to Union Station Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 11 of 15

Stochastic Gradient for Logistic Regression Given a single observation x chosen at random from the dataset, β i β i + η µβ i + x i y p(y = 1 x, β ) (11) Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 12 of 15

Stochastic Gradient for Logistic Regression Given a single observation x chosen at random from the dataset, Examples in class. β i β i + η µβ i + x i y p(y = 1 x, β ) (11) Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 12 of 15

Stochastic Gradient for Regularized Regression =logp(y x;β) µ β 2 (12) j j Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 13 of 15

Stochastic Gradient for Regularized Regression =logp(y x;β) µ β 2 (12) j Taking the derivative (with respect to example x i ) j β j =(y i p(y i = 1 x i ;β))x j 2µβ j (13) Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 13 of 15

Proofs about Stochastic Gradient Depends on convexity of objective and how close ε you want to get to actual answer Best bounds depend on changing η over time and per dimension (not all features created equal) Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 14 of 15

In class Your questions! Working through simple example Prepared for logistic regression homework Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 15 of 15