Classification: Logistic Regression from Data

Similar documents
Introduction to Machine Learning

Classification: Logistic Regression from Data

Logistic Regression. INFO-2301: Quantitative Reasoning 2 Michael Paul and Jordan Boyd-Graber SLIDES ADAPTED FROM HINRICH SCHÜTZE

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Ad Placement Strategies

Variations of Logistic Regression with Stochastic Gradient Descent

Logistic Regression Trained with Different Loss Functions. Discussion

Support Vector Machines

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Logistic Regression. Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul SLIDES ADAPTED FROM HINRICH SCHÜTZE

Stochastic Gradient Descent

Logistic Regression. Stochastic Gradient Descent

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Generative v. Discriminative classifiers Intuition

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning, Fall 2012 Homework 2

Logistic Regression Logistic

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Introduction to Logistic Regression and Support Vector Machine

Ad Placement Strategies

Regression with Numerical Optimization. Logistic

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Big Data Analytics. Lucas Rego Drumond

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Linear Models for Regression

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Logistic Regression. William Cohen

Solving Regression. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 12. Slides adapted from Matt Nedrich and Trevor Hastie

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Neural Networks: Backpropagation

Neural Networks: Backpropagation

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Online Learning and Sequential Decision Making

Optimization and Gradient Descent

CS229 Supplemental Lecture notes

Bias-Variance Tradeoff

Generative v. Discriminative classifiers Intuition

Linear Models for Classification

Classification Logistic Regression

Naïve Bayes classification

Probabilistic Machine Learning. Industrial AI Lab.

Probabilistic Machine Learning

Logistic Regression & Neural Networks

Perceptron (Theory) + Linear Regression

CS60021: Scalable Data Mining. Large Scale Machine Learning

ECS171: Machine Learning

Computational statistics

Least Mean Squares Regression

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Fast Logistic Regression for Text Categorization with Variable-Length N-grams

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Statistical Machine Learning Hilary Term 2018

ECS171: Machine Learning

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Convex Optimization Algorithms for Machine Learning in 10 Slides

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

1 Review and Overview

Bayesian Machine Learning

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Warm up: risk prediction with logistic regression

Linear Models in Machine Learning

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Statistical Data Mining and Machine Learning Hilary Term 2016

MLE/MAP + Naïve Bayes

HOMEWORK #4: LOGISTIC REGRESSION

Classification Based on Probability

Introduction to Logistic Regression

Linear classifiers: Overfitting and regularization

Linear classifiers: Logistic regression

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Machine Learning. 7. Logistic and Linear Regression

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Generative Learning algorithms

Machine Learning (CS 567) Lecture 5

Logistic Regression. Machine Learning Fall 2018

Multiclass Logistic Regression

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

Multivariate Normal Models

Neural Network Training

Least Mean Squares Regression. Machine Learning Fall 2018

Machine Learning

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Reward-modulated inference

Error Functions & Linear Regression (1)

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Multivariate Normal Models

Linear and logistic regression

Linear Regression (continued)

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Linear Regression (9/11/13)

Homework 6. Due: 10am Thursday 11/30/17

Linear & nonlinear classifiers

Transcription:

Classification: Logistic Regression from Data Machine Learning: Alvin Grissom II University of Colorado Boulder Slides adapted from Emily Fox Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 1 of 18

Reminder: Logistic Regression 1 P(Y = 0 X) = 1 + exp β 0 + i β (1) ix i P(Y = 1 X) = exp β 0 + i β ix i 1 + exp β 0 + i β (2) ix i Discriminative prediction: p(y x) Classification uses: ad placement, spam detection What we didn t talk about is how to learn β from data Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 2 of 18

Logistic Regression: Objective Function lnp(y X,β) = lnp(y (j) x (j),β) (3) j = y β (j) 0 + β i x (j) i ln 1 + exp β 0 + j i i β i x (j) i (4) Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 3 of 18

Logistic Regression: Objective Function lnp(y X,β) = lnp(y (j) x (j),β) (3) j = y β (j) 0 + β i x (j) i ln 1 + exp β 0 + j Training data (y,x) are fixed. Objective function is a function of β... what values of β give a good value. i i β i x (j) i (4) Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 3 of 18

Convexity Convex function Doesn t matter where you start, if you walk up objective Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 4 of 18

Convexity Convex function Doesn t matter where you start, if you walk up objective Gradient! Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 4 of 18

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β Objective Parameter Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 5 of 18

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β Objective Undiscovered Country Parameter Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 5 of 18

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β Objective Undiscovered Country 0 Parameter Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 5 of 18

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β Objective Undiscovered Country 0 Parameter Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 5 of 18

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β Objective Undiscovered Country 1 0 Parameter Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 5 of 18

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β Objective Undiscovered Country 1 0 Parameter Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 5 of 18

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β Objective Undiscovered Country 2 1 0 Parameter Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 5 of 18

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β Objective Undiscovered Country 2 1 0 Parameter Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 5 of 18

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β Objective Undiscovered Country 3 2 1 0 Parameter Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 5 of 18

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β α 0 β J β ij (l) (l+1) β ij Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 5 of 18

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β α 0 β J β ij (l) (l+1) β ij Luckily, (vanilla) logistic regression is convex Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 5 of 18

Gradient for Logistic Regression To ease notation, let s define π i = expβt x i 1 + expβ T x i (5) Our objective function is logπ i if y i = 1 = logp(y i x i ) = i = log(1 π i ) if y i = 0 i i i (6) Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 6 of 18

Taking the Derivative Apply chain rule: β j = i i ( β) = β j i 1 πi π i β if y j i = 1 1 1 π i π i β if y j i = 0 (7) If we plug in the derivative, we can merge these two cases π i β j = π i (1 π i )x j, (8) i β j = (y i π i )x j. (9) Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 7 of 18

Gradient for Logistic Regression Gradient β ( β) = ( β) β 0,..., ( β) β n (10) Update β η β ( β) (11) β i β i + η ( β) β i (12) Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 8 of 18

Gradient for Logistic Regression Gradient β ( β) = ( β) β 0,..., ( β) β n (10) Update β η β ( β) (11) β i β i + η ( β) β i (12) Why are we adding? What would well do if we wanted to do descent? Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 8 of 18

Gradient for Logistic Regression Gradient β ( β) = ( β) β 0,..., ( β) β n (10) Update η: step size, must be greater than zero β η β ( β) (11) β i β i + η ( β) β i (12) Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 8 of 18

Gradient for Logistic Regression Gradient β ( β) = ( β) β 0,..., ( β) β n (10) Update β η β ( β) (11) β i β i + η ( β) β i (12) NB: Conjugate gradient is usually better, but harder to implement Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 8 of 18

Choosing Step Size Objective Parameter Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 9 of 18

Choosing Step Size Objective Parameter Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 9 of 18

Choosing Step Size Objective Parameter Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 9 of 18

Choosing Step Size Objective Parameter Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 9 of 18

Choosing Step Size Objective Parameter Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 9 of 18

Remaining issues When to stop? What if β keeps getting bigger? Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 10 of 18

Regularized Conditional Log Likelihood Unregularized Regularized β = argmax ln p(y (j) x (j),β) (13) β β = argmax ln p(y (j) x (j),β) µ β i β 2 i (14) Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 11 of 18

Regularized Conditional Log Likelihood Unregularized Regularized β = argmax ln p(y (j) x (j),β) (13) β β = argmax ln p(y (j) x (j),β) µ β i β 2 i (14) µ is regularization parameter that trades off between likelihood and having small parameters Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 11 of 18

Approximating the Gradient Our datasets are big (to fit into memory)... or data are changing / streaming Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 12 of 18

Approximating the Gradient Our datasets are big (to fit into memory)... or data are changing / streaming Hard to compute true gradient (β) x [ (β,x)] (15) Average over all observations Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 12 of 18

Approximating the Gradient Our datasets are big (to fit into memory)... or data are changing / streaming Hard to compute true gradient (β) x [ (β,x)] (15) Average over all observations What if we compute an update just from one observation? Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 12 of 18

Getting to Union Station Pretend it s a pre-smartphone world and you want to get to Union Station Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 13 of 18

Stochastic Gradient for Logistic Regression Given a single observation x i chosen at random from the dataset, β j β + η µβ + x j j ij [y i π i ] (16) Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 14 of 18

Stochastic Gradient for Logistic Regression Given a single observation x i chosen at random from the dataset, β j β + η µβ + x j j ij [y i π i ] (16) Examples in class. Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 14 of 18

Stochastic Gradient for Regularized Regression =logp(y x;β) µ β 2 (17) j j Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 15 of 18

Stochastic Gradient for Regularized Regression =logp(y x;β) µ β 2 (17) j Taking the derivative (with respect to example x i ) j β j =(y i π i )x j 2µβ j (18) Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 15 of 18

Algorithm 1 Initialize a vector B to be all zeros 2 For t = 1,...,T For each example x i,y i and feature j: Compute π i Pr(y i = 1 x i ) Set β[j] = β[j] + λ(y i π i )x i 3 Output the parameters β 1,...,β d. Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 16 of 18

Proofs about Stochastic Gradient Depends on convexity of objective and how close ε you want to get to actual answer Best bounds depend on changing η over time and per dimension (not all features created equal) Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 17 of 18

In class Your questions! Working through simple example Prepared for logistic regression homework Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 18 of 18