Logistic Regression Trained with Different Loss Functions. Discussion

Similar documents
Regression with Numerical Optimization. Logistic

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

CS229 Supplemental Lecture notes

ECS171: Machine Learning

ECE521 Lectures 9 Fully Connected Neural Networks

Optimization and Gradient Descent

ECS171: Machine Learning

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Linear Models in Machine Learning

Generative v. Discriminative classifiers Intuition

Gaussian and Linear Discriminant Analysis; Multiclass Classification

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Classification: Logistic Regression from Data

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Logistic Regression. Machine Learning Fall 2018

Linear Models for Classification

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Midterm exam CS 189/289, Fall 2015

STA141C: Big Data & High Performance Statistical Computing

Support Vector Machines

Introduction to Machine Learning

Midterm, Fall 2003

Classification: Logistic Regression from Data

Variations of Logistic Regression with Stochastic Gradient Descent

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

Statistical Data Mining and Machine Learning Hilary Term 2016

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Classification Logistic Regression

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Logistic Regression & Neural Networks

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Classification objectives COMS 4771

Least Mean Squares Regression. Machine Learning Fall 2018

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Least Mean Squares Regression

Ordinary Least Squares Linear Regression

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

CSC 411: Lecture 04: Logistic Regression

Introduction to Logistic Regression and Support Vector Machine

Notes on Discriminant Functions and Optimal Classification

18.6 Regression and Classification with Linear Models

Support Vector Machines and Kernel Methods

Overfitting, Bias / Variance Analysis

Support Vector Machines and Bayes Regression

Warm up: risk prediction with logistic regression

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

A Study of Relative Efficiency and Robustness of Classification Methods

DATA MINING AND MACHINE LEARNING

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

The exam is closed book, closed notes except your one-page cheat sheet.

Machine Learning Basics: Maximum Likelihood Estimation

Linear classifiers: Logistic regression

Linear Models for Regression CS534

Introduction to Bayesian Learning. Machine Learning Fall 2018

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Neural Networks. Xiaojin Zhu Computer Sciences Department University of Wisconsin, Madison. slide 1

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Logistic Regression. COMP 527 Danushka Bollegala

Machine Learning (CS 567) Lecture 5

Reward-modulated inference

Machine Learning for NLP

Machine Learning for Signal Processing Bayes Classification and Regression

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning Practice Page 2 of 2 10/28/13

Generative v. Discriminative classifiers Intuition

Stochastic Gradient Descent

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Bayesian Machine Learning

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Linear and logistic regression

Natural Language Processing

Bias-Variance Tradeoff

Linear Models for Regression CS534

Lecture 5: Logistic Regression. Neural Networks

FINAL: CS 6375 (Machine Learning) Fall 2014

CSC321 Lecture 4: Learning a Classifier

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Neural Networks: Backpropagation

CPSC 340: Machine Learning and Data Mining

Tufts COMP 135: Introduction to Machine Learning

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

ECS289: Scalable Machine Learning

Homework 6. Due: 10am Thursday 11/30/17

Least Squares Regression

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Transcription:

Logistic Regression Trained with Different Loss Functions Discussion CS640 Notations We restrict our discussions to the binary case. g(z) = g (z) = g(z) z h w (x) = g(wx) = + e z = g(z)( g(z)) + e wx = + e d wd x d P (y = x; w) = h w (x) P (y = 0 x; w) = h w (x) 2 Maximum Likelihood Estimation 2. Goal Maximize likelihood: L(w) = p(y X; w) m = p(y i x i ; w) = m (h w (x i )) yi ( h w (x i )) yi

Or equivalently, maximize the log likelihood: l(w) = log L(w) m = y i log h(x i ) + ( y i ) log( h(x i )) 2.2 Stochastic Gradient Descent Update Rule w j l(w) = (y g(wx i ) ( y) g(wx i ) ) w j g(wx i) = (y g(wx i ) ( y) g(wx i ) )g(wx i)( g(wx i )) w j wx i = (y( g(wx i )) ( y)g(wx i ))x j i = (y h w (x i ))x j i w j := w j + λ(y i h w (x i )))x j i 3 Least Squared Error Estimation 3. Goal Minimize sum of squared error: L(w) = 2 m (y i h w (x i )) 2 3.2 Stochastic Gradient Descent Update Rule w j L(w) = (y i h w (x i )) h w(x i ) w j = (y i h w (x i ))h w (x i )( h w (x i ))x j i 4 Comparison w j := w j + λ(y i h w (x i ))h w (x i )( h w (x i ))x j i 4. Update Rule For maximum likelihood logistic regression: w j := w j + λ(y i h w (x i )))x j i 2

For least squared error logistic regression: Let w j := w j + λ(y i h w (x i ))h w (x i )( h w (x i ))x j i f (h) = (y h), y {0, }, h (0, ) f 2 (h) = (y h)h( h), y {0, }, h (0, ) When y =, the plots of f (h) and f 2 (h) are shown in figure. f (h) is a monotonously decreasing function, the closer the predicted probability is to, the smaller the update is. The farther away the predicted probability is from, the bigger the update is. f 2 (h) is not monotone. When the predicted probability is close to 0 (far away from the true label), there is almost no update. The biggest update is achieved when h = 3. This behavior seems unintuitive. One tentative explanation for f 2 : f 2 is robust to outliers. In the range of ( 3, ), the farther away the predicted probability is from, the bigger the update is. In this range, f 2 is willing to better incorporate the data point into the model. In the range of (0, 3 ), the algorithm thinks the predicted value is too far away from the true value, and it s very likely that the data point is mislabeled. In this case, we should give up this data point. The farther away the predicted probability is from, the more likely the data point is mislabeled, the less effort we should put on it, thus the smaller the update should be. Therefor 3 can be regarded as a giving up threshold. When y = 0, the plots of f (h) and f 2 (h) are shown in figure 2. f (h) is again a monotone function. f 2 (h) is not monotone. When the predicted probability is close to (far away from the true label), there is almost no update. The biggest(by absolute value) update is achieved when h = 2 3. This behavior seems unintuitive. A similar argument based on robustness can be made. 4.2 Consistency Based on the analysis in [], both loss functions can ensure consistency. The classification errors converge to optimal Bayes error as sample size goes to infinity(if we don t restrict the function class). 4.3 Probability Estimation 4.4 Robustness If we think logistic regression outputs a probability distribution vector, then both methods try to minimize the distance between the true probability distribution vector and the predicted probability distribution vector. The only difference is in how the distance is defined. For maximum likelihood method, the distance measure is KL-divergence. For least squared error method, the distance is Euclidean distance. If for a data point x, the true probability distribution is [0, ], i.e., the true label is, but the predicted probability distribution is [0.99, 0.0], then 3

Figure : y= the KL-divergence between the two distribution is log 2 00 = 6.64; the squared Euclidean distance between the two distribution is 0.99 2 + 0.99 2 =.96. Since KL-divergence is unbounded, but (squared) Euclidean distance has an upper bound 2, it seems logistic regression trained by minimizing squared error is more robust to outliers. 4.5 Convexity and Convergence The loss function for maximum likelihood estimation is m C (w) = l(w) = y i log h(x i ) + ( y i ) log( h(x i )) C (w) is a convex function. A proof for the convexity can be found at http://mathgotchas.blogspot.com/20/0/why-is-error-function-minimized-in. html For one data point x = (, ), y =, the plot of C (w) is shown in figure 3. The loss function for the least square estimation is C 2 (w) = L(w) = m (y i h w (x i )) 2 2 4

Figure 2: y=0 Figure 3: C (w) C 2 (w) is a non-convex function. For one data point x = (, ), y =, the plot of C 2 (w) is shown in figure 4. 5

Figure 4: C 2 (w) So we can see, training logistic regression by maximizing the likelihood is a convex optimization problem, which is easier; training logistic regression by minimizing squared error is a non-convex optimization problem, which is harder. If we think logistic regression trained with least square error as single node neural network, then it s easy to imagine that the same non-convexity problem would also arise in multi-layer neural networks. Regarding convexity vs nonconvexity in machine learning, there is an interesting discussion: https://cs.nyu.edu/~yann/talks/lecun-2007207-nonconvex.pdf 4.6 Newton Method 5 Empirical Study 6 Conclusion 6

References [] Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, pages 56 85, 2004. 7