Logistic Regression. Machine Learning Fall 2018

Similar documents
Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Introduction to Bayesian Learning. Machine Learning Fall 2018

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Bayesian Learning (II)

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Bias-Variance Tradeoff

CMU-Q Lecture 24:

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Stochastic Gradient Descent

CPSC 340: Machine Learning and Data Mining

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Warm up: risk prediction with logistic regression

The Naïve Bayes Classifier. Machine Learning Fall 2017

Machine Learning

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

CSCI-567: Machine Learning (Spring 2019)

ECE521 week 3: 23/26 January 2017

Generative v. Discriminative classifiers Intuition

COS 424: Interacting with Data. Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007

10-701/ Machine Learning - Midterm Exam, Fall 2010

Support Vector Machines

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

ECE 5984: Introduction to Machine Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Neural networks and support vector machines

Machine Learning Gaussian Naïve Bayes Big Picture

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

COMP90051 Statistical Machine Learning

Machine Learning Lecture 5

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Machine Learning

Statistical Data Mining and Machine Learning Hilary Term 2016

CS229 Supplemental Lecture notes

Lecture 3a: The Origin of Variational Bayes

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Naïve Bayes classification

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Machine Learning for NLP

Machine Learning: Logistic Regression. Lecture 04

Kernelized Perceptron Support Vector Machines

Logistic Regression & Neural Networks

Natural Language Processing (CSEP 517): Text Classification

Introduction to Logistic Regression and Support Vector Machine

ECE521 Lecture7. Logistic Regression

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Introduction to Machine Learning

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning

Machine Learning, Fall 2012 Homework 2

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Classification Based on Probability

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Support vector machines Lecture 4

CSC 411: Lecture 04: Logistic Regression

y(x) = x w + ε(x), (1)

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

FINAL: CS 6375 (Machine Learning) Fall 2014

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Day 4: Classification, support vector machines

Machine Learning (CS 567) Lecture 3

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

MLE/MAP + Naïve Bayes

Learning with Probabilities

MLE/MAP + Naïve Bayes

Neural Networks: Introduction

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Support Vector Machines

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Linear Models for Classification

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Probabilistic Machine Learning. Industrial AI Lab.

Day 5: Generative models, structured classification

Bayesian Machine Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Logistic Regression. COMP 527 Danushka Bollegala

Discriminative Models

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Lecture 2 Machine Learning Review

EE 511 Online Learning Perceptron

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Transcription:

Logistic Regression Machine Learning Fall 2018 1

Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes classifier 2

This lecture Logistic regression Connection to Naïve Bayes Training a logistic regression classifier Back to loss minimization 3

This lecture Logistic regression Connection to Naïve Bayes Training a logistic regression classifier Back to loss minimization 4

Logistic Regression: Setup The setting Binary classification Inputs: Feature vectors x 2 < d Labels: y 2 {-1, +1} Training data S = {(x i, y i )}, m examples 5

Classification, but The output y is discrete valued (-1 or 1) Instead of predicting the output, let us try to predict P(y = 1 x) Expand hypothesis space to functions hose output is [0-1] Original problem: < d! {-1, 1} Modified problem: < d! [0-1] Effectively make the problem a regression problem Many hypothesis spaces possible 6

Classification, but The output y is discrete valued (-1 or 1) Instead of predicting the output, let us try to predict P(y = 1 x) Expand hypothesis space to functions hose output is [0-1] Original problem: < d! {-1, 1} Modified problem: < d! [0-1] Effectively make the problem a regression problem Many hypothesis spaces possible 7

The Sigmoid function The hypothesis space for logistic regression: All functions of the form That is, a linear function, composed ith a sigmoid function (the logistic function) ¾ What is the domain and the range of the sigmoid function? This is a reasonable choice. We ill see hy later 8

The Sigmoid function The hypothesis space for logistic regression: All functions of the form That is, a linear function, composed ith a sigmoid function (the logistic function) ¾ This is a reasonable choice. We ill see hy later 9

The Sigmoid function The hypothesis space for logistic regression: All functions of the form That is, a linear function, composed ith a sigmoid function (the logistic function) ¾ What is the domain and the range of the sigmoid function? This is a reasonable choice. We ill see hy later 10

The Sigmoid function ¾(z) z 11

The Sigmoid function What is its derivative ith respect to z? 12

The Sigmoid function What is its derivative ith respect to z? 13

Predicting probabilities According to the logistic regression model, e have 14

Predicting probabilities According to the logistic regression model, e have 15

Predicting probabilities According to the logistic regression model, e have 16

Predicting probabilities According to the logistic regression model, e have Or equivalently 17

Predicting probabilities According to the logistic regression model, e have Or equivalently Note that e are directly modeling P(y x) rather than P(x y) and P(y) 18

Predicting a label ith logistic regression Compute P(y =1 x; ) If this is greater than half, predict 1 else predict -1 What does this correspond to in terms of T x? 19

Predicting a label ith logistic regression Compute P(y =1 x; ) If this is greater than half, predict 1 else predict -1 What does this correspond to in terms of T x? Prediction = sgn( T x) 20

This lecture Logistic regression Connection to Naïve Bayes Training a logistic regression classifier Back to loss minimization 21

Naïve Bayes and Logistic regression Remember that the naïve Bayes decision is a linear function log P(y = 1 x, ) P(y = +1 x, ) = 2 x Here, the P s represent the Naïve Bayes posterior distribution, and can be used to calculate the priors and the likelihoods. That is, P(y = 1, x) is computed using P(x y = 1, ) and P(y = 1 ) 22

Naïve Bayes and Logistic regression Remember that the naïve Bayes decision is a linear function log P(y = 1 x, ) P(y = +1 x, ) = 2 x But e also kno that P y = +1 x, = 1 P(y = 1 x, ) 23

Naïve Bayes and Logistic regression Remember that the naïve Bayes decision is a linear function log P(y = 1 x, ) P(y = +1 x, ) = 2 x But e also kno that P y = +1 x, = 1 P(y = 1 x, ) Substituting in the above expression, e get P y = +1, x = σ 2 x = 1 1 + exp ( 2 x) 24

Naïve Bayes and Logistic regression Remember that the naïve Bayes decision is a linear function P(y = 1 x, ) log P(y = +1 x, ) = 2 x That is, both naïve Bayes and logistic regression try to compute the same posterior distribution over the outputs But e also kno that P y = +1 x, = 1 P(y = 1 x, ) Naïve Bayes is a generative model. Substituting in the above expression, e get Logistic Regression is the discriminative version. P y = +1, x = σ 2 x = 1 1 + exp ( 2 x) 25

This lecture Logistic regression Connection to Naïve Bayes Training a logistic regression classifier First: Maximum likelihood estimation Then: Adding priors à Maximum a Posteriori estimation Back to loss minimization 26

Maximum likelihood estimation Let s get back to the problem of learning Training data S = {(x i, y i )}, m examples What e ant Find a such that P(S ) is maximized We kno that our examples are dran independently and are identically distributed (i.i.d) Ho do e proceed? 27

Maximum likelihood estimation argmax P S = argmax = ; P y < x <, ) <>? The usual trick: Convert products to sums by taking log Recall that this orks only because log is an increasing function and the maximizer ill not change 28

Maximum likelihood estimation argmax P S = argmax Equivalent to solving = ; P y < x <, ) <>? max = @ log P y < x <, ) < 29

Maximum likelihood estimation = argmax P S = argmax = ; P y < x <, ) <>? max @ log P y < x <, ) < But (by definition) e kno that P y, x = σ y < 2 x < = 1 1 + exp ( y < 2 x < ) 30

P y, x = Maximum likelihood estimation 1 1 + exp ( y B 2 x < ) argmax P S = argmax max = = ; P y < x <, ) <>? @ log P y < x <, ) < Equivalent to solving = max @ log(1 + exp ( y < 2 x < ) < 31

Maximum likelihood estimation argmax The goal: Maximum likelihood training of a discriminative probabilistic classifier under the logistic model for the posterior distribution. P S = argmax max = = ; P y < x <, ) <>? @ log P y < x <, ) < Equivalent to solving = P y, x = max @ log(1 + exp ( y < 2 x < ) < 1 1 + exp ( y B 2 x < ) 32

Maximum likelihood estimation argmax The goal: Maximum likelihood training of a discriminative probabilistic classifier under the logistic model for the posterior distribution. P S = argmax max = = ; P y < x <, ) <>? @ log P y < x <, ) < Equivalent to solving = P y, x = max @ log(1 + exp ( y < 2 x < ) < Equivalent to: Training a linear classifier by minimizing the logistic loss. 1 1 + exp ( y B 2 x < ) 33

Maximum a posteriori estimation We could also add a prior on the eights Suppose each eight in the eight vector is dran independently from the normal distribution ith zero mean and standard deviation σ E p = ; p( < ) F>? E = ; 1 σ 2π exp J < σ J F>? 34

MAP estimation for logistic regression E p = ; p( < ) F>? E = ; 1 σ 2π exp J < σ J F>? Let us ork through this procedure again to see hat changes 35

MAP estimation for logistic regression E p = ; p( < ) F>? E = ; 1 σ 2π exp J < σ J F>? Let us ork through this procedure again to see hat changes What is the goal of MAP estimation? (In maximum likelihood, e maximized the likelihood of the data) 36

MAP estimation for logistic regression E p = ; p( < ) F>? E = ; 1 σ 2π exp J < σ J F>? What is the goal of MAP estimation? (In maximum likelihood, e maximized the likelihood of the data) To maximize the posterior probability of the model given the data (i.e. to find the most probable model, given the data) P S P S P() 37

MAP estimation for logistic regression E p = ; p( < ) F>? Learning by solving argmax E = ; 1 σ 2π exp J < σ J F>? P( S) = argmax P S P() 38

MAP estimation for logistic regression E p = ; p( < ) F>? Learning by solving argmax Take log to simplify max E = ; 1 σ 2π exp J < σ J F>? P S P() log P S + log P() 39

MAP estimation for logistic regression We have already expanded out the first term. = @ log(1 + exp ( y < 2 x < ) < E p = ; p( < ) F>? Learning by solving argmax Take log to simplify max E = ; 1 σ 2π exp J < σ J F>? P S P() log P S + log P() 40

MAP estimation for logistic regression E p = ; p( < ) F>? Learning by solving argmax Take log to simplify max E = ; 1 σ 2π exp J < σ J F>? P S P() log P S + log P() = @ log(1 + exp ( y < 2 x < ) < E + @ < J F>? σ J + constants Expand the log prior 41

MAP estimation for logistic regression E p = ; p( < ) F>? Learning by solving argmax Take log to simplify max E = ; 1 σ 2π exp J < σ J F>? P S P() log P S + log P() = max @ log(1 + exp ( y < 2 x < ) < E + @ < J F>? σ J + constants 42

MAP estimation for logistic regression E p = ; p( < ) F>? Learning by solving argmax Take log to simplify max E = ; 1 σ 2π exp J < σ J F>? P S P() log P S + log P() = max @ log(1 + exp ( y < 2 x < ) 1 σ J 2 < 43

MAP estimation for logistic regression E p = ; p( < ) F>? Learning by solving argmax Take log to simplify max E = ; 1 σ 2π exp J < σ J F>? P S P() log P S + log P() = max @ log(1 + exp ( y < 2 x < ) < 1 σ J 2 Maximizing a negative function is the same as minimizing the function 44

Learning a logistic regression classifier Learning a logistic regression classifier is equivalent to solving = min @ log(1 + exp ( y < 2 x < ) + 1 σ J 2 < 45

Learning a logistic regression classifier Learning a logistic regression classifier is equivalent to solving = min @ log(1 + exp ( y < 2 x < ) + 1 σ J 2 < Where have e seen this before? 46

Learning a logistic regression classifier Learning a logistic regression classifier is equivalent to solving = min @ log(1 + exp ( y < 2 x < ) + 1 σ J 2 < Where have e seen this before? The first question in the homeork: Write don the stochastic gradient descent algorithm for this? Historically, other training algorithms exist. In particular, you might run into LBFGS 47

Logistic regression is A classifier that predicts the probability that the label is +1 for a particular input The discriminative counter-part of the naïve Bayes classifier A discriminative classifier that can be trained via MAP or MLE estimation A discriminative classifier that minimizes the logistic loss over the training set 48

This lecture Logistic regression Connection to Naïve Bayes Training a logistic regression classifier Back to loss minimization 49

Learning as loss minimization The setup Examples x dran from a fixed, unknon distribution D Hidden oracle classifier f labels examples We ish to find a hypothesis h that mimics f The ideal situation Define a function L that penalizes bad hypotheses Learning: Pick a function h 2 H to minimize expected loss But distribution D is unknon Instead, minimize empirical loss on the training set 50

Empirical loss minimization Learning = minimize empirical loss on the training set Is there a problem here? 51

Empirical loss minimization Learning = minimize empirical loss on the training set Is there a problem here? Overfitting! We need something that biases the learner toards simpler hypotheses Achieved using a regularizer, hich penalizes complex hypotheses 52

Regularized loss minimization Learning: With linear classifiers: (using l2 regularization) What is a loss function? Loss functions should penalize mistakes We are minimizing average loss over the training data What is the ideal loss function for classification? 53

The 0-1 loss Penalize classification mistakes beteen true label y and prediction y For linear classifiers, the prediction y = sgn( T x) Mistake if y T x 0 Minimizing 0-1 loss is intractable. Need surrogates 54

The loss function zoo Many loss functions exist Perceptron loss Hinge loss (SVM) Exponential loss (AdaBoost) Logistic loss (logistic regression) 55

The loss function zoo 56

The loss function zoo Zero-one 57

The loss function zoo Hinge: SVM Zero-one 58

The loss function zoo Perceptron Hinge: SVM Zero-one 59

The loss function zoo Hinge: SVM Perceptron Exponential: AdaBoost Zero-one 60

The loss function zoo Hinge: SVM Perceptron Exponential: AdaBoost Zero-one Logistic regression 61

The loss function zoo Zoomed out 62

The loss function zoo Zoomed out even more 63