Logis&c Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Similar documents
Classification Based on Probability

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Least Mean Squares Regression. Machine Learning Fall 2017

COMP 562: Introduction to Machine Learning

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Unsupervised Learning: K-Means, Gaussian Mixture Models

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Gaussian Mixture Models, Expectation Maximization

CS 6140: Machine Learning Spring 2016

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

Warm up: risk prediction with logistic regression

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Logistic Regression. Machine Learning Fall 2018

CS229 Supplemental Lecture notes

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

UVA CS 4501: Machine Learning. Lecture 6: Linear Regression Model with Dr. Yanjun Qi. University of Virginia

Logistic Regression. COMP 527 Danushka Bollegala

Regression.

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Classification: Decision Trees

Machine Learning for NLP

Probability Basics. Robot Image Credit: Viktoriya Sukhanova 123RF.com

ECE521 Lecture7. Logistic Regression

Natural Language Processing with Deep Learning. CS224N/Ling284

Recita,on: Loss, Regulariza,on, and Dual*

Lecture 4 Logistic Regression

Machine Learning 4771

Methodological Foundations of Biomedical Informatics (BMSC-GA 4449) Optimization

Midterm, Fall 2003

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Bias/variance tradeoff, Model assessment and selec+on

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

ECE521 week 3: 23/26 January 2017

Machine Learning Lecture 5

Linear Models for Regression CS534

Classification Logistic Regression

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Regression with Numerical Optimization. Logistic

Machine Learning Practice Page 2 of 2 10/28/13

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Logistic Regression. William Cohen

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Machine Learning and Data Mining. Linear regression. Prof. Alexander Ihler

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Unsupervised Learning: K- Means & PCA

Last Lecture Recap UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 3: Linear Regression

Classification objectives COMS 4771

Introduction to Machine Learning

Linear Models for Regression CS534

Algorithms for NLP. Classifica(on III. Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Support Vector Machines and Kernel Methods

Classification Logistic Regression

Mixture Models. Michael Kuhn

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

Ad Placement Strategies

Boos$ng Can we make dumb learners smart?

Introduction to Machine Learning

Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on

Naïve Bayes classification

CMU-Q Lecture 24:

Generative v. Discriminative classifiers Intuition

Natural Language Processing with Deep Learning. CS224N/Ling284

Generative Model (Naïve Bayes, LDA)

Logistic Regression & Neural Networks

Is the test error unbiased for these programs?

Linear Models for Regression CS534

Founda'ons of Large- Scale Mul'media Informa'on Management and Retrieval. Lecture #3 Machine Learning. Edward Chang

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

Logistic Regression Logistic

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Linear & nonlinear classifiers

Tufts COMP 135: Introduction to Machine Learning

Machine Learning Lecture 7

Introduction to Logistic Regression and Support Vector Machine

Bias-Variance Tradeoff

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning, Fall 2012 Homework 2

Introduction to Logistic Regression

Ch 4. Linear Models for Classification

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

CS 340 Lec. 16: Logistic Regression

9 Classification. 9.1 Linear Classifiers

ECE 5984: Introduction to Machine Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

INTRODUCTION TO DATA SCIENCE

Midterm: CS 6375 Spring 2015 Solutions

CPSC 340: Machine Learning and Data Mining

Transcription:

Logis&c Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these slides for your own academic purposes, provided that you include proper airibu&on. Please send comments and correc&ons to Eric. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Classifica&on Based on Probability Instead of just predic&ng the class, give the probability of the instance being that class i.e., learn Comparison to perceptron: Perceptron doesn t produce probability es&mate Perceptron (and other discrimina&ve classifiers) are only interested in producing a discrimina&ve model Recall that: p(y x) 0 apple p(event) apple 1 p(event) + p( event) = 1 2

Logis&c Regression Takes a probabilis&c approach to learning discrimina&ve func&ons (i.e., a classifier) h (x) should give p(y =1 x; ) Want 0 apple h (x) apple 1 Can t just use linear regression with a threshold Logis&c regression model: h (x) =g ( x) g(z) = 1 1+e z Logis&c / Sigmoid Func&on g(z) h (x) = 1 1+e T x 3

Interpreta&on of Hypothesis Output h (x) = es&mated p(y =1 x; ) Example: Cancer diagnosis from tumor size apple apple x0 1 x = = x 1 tumorsize h (x) =0.7 à Tell pa&ent that 70% chance of tumor being malignant Note that: p(y =0 x; )+p(y =1 x; ) =1 Therefore, p(y =0 x; ) =1 p(y =1 x; ) Based on example by Andrew Ng 4

Another Interpreta&on Equivalently, logis&c regression assumes that log p(y =1 x; ) p(y =0 x; ) = 0 + 1 x 1 +...+ d x d odds of y = 1 Side Note: the odds in favor of an event is the quan&ty p / (1 p), where p is the probability of the event E.g., If I toss a fair dice, what are the odds that I will have a 6? In other words, logis&c regression assumes that the log odds is a linear func&on of x Based on slide by Xiaoli Fern 5

Logis&c Regression h (x) =g ( x) g(z) = 1 1+e z g(z) ( x) should be large nega&ve ( x) should be large posi&ve values for nega&ve instances values for posi&ve instances Assume a threshold and... Predict y = 1 if h (x) 0.5 Predict y = 0 if h (x) < 0.5 y = 1 y = 0 Based on slide by Andrew Ng 6

Non-Linear Decision Boundary Can apply basis func&on expansion to features, same as with linear regression 2 3 1 x 1 x 2 2 x = 4 1 3 x 1 x 2 x 1 5 x 2 1! x 2 x 2 2 x 2 1x 2 6 4 x 1 x 2 7 2 5. 7

Logis&c Regression Given where n x (1),y (1), x (2),y (2),..., x (i) 2 R d, y (i) 2{0, 1} x (n),y (n) o Model: 2 = 6 4 h (x) =g ( x) 0 1. d g(z) = 3 7 5 1 1+e z x = 1 x 1... x d 8

Logis&c Regression Objec&ve Func&on Can t just use squared loss as in linear regression: J( ) = 1 2n nx h x (i) y (i) 2 Using the logis&c regression model h (x) = results in a non-convex op&miza&on 1 1+e T x 9

Deriving the Cost Func&on via Maximum Likelihood Es&ma&on Likelihood of data is given by: l( ) = Can take the log without changing the solu&on: YnY MLE = arg max log p(y (i) x (i) ; ) ny p(y (i) x (i) ; ) So, looking for the θ that maximizes the likelihood ny MLE = arg max l( ) = arg max p(y (i) x (i) ; ) = arg max XnX log p(y (i) x (i) ; ) 10

Deriving the Cost Func&on via Maximum Likelihood Es&ma&on Expand as follows: MLE = arg max = arg max XnX log p(y (i) x (i) ; ) Xh nx h y (i) log p(y (i) =1 x (i) ; )+ 1 y (i) i log 1 p(y (i) =1 x (i) ; ) Subs&tute in model, and take nega&ve to yield Logis,c regression objec,ve: min J( ) nx h J( ) = y (i) log h (x (i) )+ 1 y (i) log i 1 h (x (i) ) 11

Intui&on Behind the Objec&ve J( ) = nx h y (i) log h (x (i) )+ 1 y (i) log i 1 h (x (i) ) Cost of a single instance: log(h cost (h (x),y)= (x)) if y =1 log(1 h (x)) if y =0 Can re-write objec&ve func&on as nx J( ) = cost h (x (i) ),y (i) Compare to linear regression: J( ) = 1 2n nx h x (i) y (i) 2 12

Intui&on Behind the Objec&ve cost (h (x),y)= log(h (x)) if y =1 log(1 h (x)) if y =0 Aside: Recall the plot of log(z) 13

Intui&on Behind the Objec&ve cost (h (x),y)= log(h (x)) if y =1 log(1 h (x)) if y =0 If y = 1 If y = 1 Cost = 0 if predic&on is correct As h (x)! 0, cost!1 cost Captures intui&on that larger mistakes should get larger penal&es 0 h (x) 1 e.g., predict h (x) =0, but y = 1 Based on example by Andrew Ng 14

Intui&on Behind the Objec&ve cost (h (x),y)= log(h (x)) if y =1 log(1 h (x)) if y =0 cost If y = 1 If y = 0 If y = 0 Cost = 0 if predic&on is correct As (1 h (x))! 0, cost!1 Captures intui&on that larger mistakes should get larger penal&es 0 h (x) 1 Based on example by Andrew Ng 15

Regularized Logis&c Regression J( ) = nx h y (i) log h (x (i) )+ 1 y (i) log i 1 h (x (i) ) We can regularize logis&c regression exactly as before: dx J regularized ( ) =J( )+ 2 j=1 2 j = J( )+ 2 k [1:d] k 2 2 16

Gradient Descent for Logis&c Regression J reg ( ) = Want min nx h y (i) log h (x (i) )+ J( ) 1 y (i) i log 1 h (x (i) ) + 2 k [1:d] k 2 2 Ini&alize Repeat un&l convergence j j @ J( ) @ j simultaneous update for j = 0... d Use the natural logarithm (ln = log e ) to cancel with the exp() in h (x) 17

Gradient Descent for Logis&c Regression J reg ( ) = Want min nx h y (i) log h (x (i) )+ J( ) 1 y (i) i log 1 h (x (i) ) + 2 k [1:d] k 2 2 Ini&alize Repeat un&l convergence nx 0 0 h x (i) y (i) j j " n X h x (i) (simultaneous update for j = 0... d) y (i) x (i) j + j # 18

Gradient Descent for Logis&c Regression Ini&alize Repeat un&l convergence nx 0 0 h x (i) y (i) j j " n X This looks IDENTICAL to linear regression!!! Ignoring the 1/n constant However, the form of the model is very different: h (x) = h x (i) 1 1+e T x (simultaneous update for j = 0... d) y (i) x (i) j + j # 19

Mul&-Class Classifica&on Binary classifica&on: Mul&-class classifica&on: x 2 x 2 x 1 Disease diagnosis: x 1 healthy / cold / flu / pneumonia Object classifica&on: desk / chair / monitor / bookcase 20

Mul&-Class Logis&c Regression For 2 classes: 1 h (x) = 1+exp( T x) = exp( T x) 1+exp( T x) weight assigned to y = 0 weight assigned to y = 1 For C classes {1,..., C }: p(y = c x; 1,..., C )= exp( T c x) P C c=1 exp( T c x) Called the so3max func&on 21

Mul&-Class Logis&c Regression Split into One vs Rest: x 2 x 1 Train a logis&c regression classifier for each class i to predict the probability that y = i with h c (x) = exp( T c x) P C c=1 exp( T c x) 22

Implemen&ng Mul&-Class Logis&c Regression Use h c (x) = exp( T c x) P C c=1 exp( T c x) as the model for class c Gradient descent simultaneously updates all parameters for all models Same deriva&ve as before, just with the above h c (x) Predict class label as the most probable label max c h c (x) 23