Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Similar documents
Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Regression with Numerical Optimization. Logistic

Linear and logistic regression

Lecture 5: Logistic Regression. Neural Networks

Neural Network Training

Introduction to Machine Learning

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning - Waseda University Logistic Regression

Introduction to gradient descent

Logistic Regression. Stochastic Gradient Descent

Machine Learning 2017

CS 340 Lec. 16: Logistic Regression

Machine Learning Basics III

Logistic Regression. Seungjin Choi

Lecture 3 - Linear and Logistic Regression

CSC 411: Lecture 04: Logistic Regression

Midterm exam CS 189/289, Fall 2015

Machine Learning for NLP

CSC321 Lecture 4: Learning a Classifier

Comparison of Modern Stochastic Optimization Algorithms

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Linear Models in Machine Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

HOMEWORK #4: LOGISTIC REGRESSION

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Linear Regression. S. Sumitra

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

ECS171: Machine Learning

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Deep Learning. Authors: I. Goodfellow, Y. Bengio, A. Courville. Chapter 4: Numerical Computation. Lecture slides edited by C. Yim. C.

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

HOMEWORK #4: LOGISTIC REGRESSION

CSC321 Lecture 4: Learning a Classifier

Linear Models for Classification

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Least Mean Squares Regression

Max Margin-Classifier

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

CS229 Supplemental Lecture notes

A summary of Deep Learning without Poor Local Minima

Warm up: risk prediction with logistic regression

Gaussian and Linear Discriminant Analysis; Multiclass Classification

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

OPTIMIZATION METHODS IN DEEP LEARNING

1 Machine Learning Concepts (16 points)

Linear Discrimination Functions

Least Mean Squares Regression. Machine Learning Fall 2018

Linear classifiers: Overfitting and regularization

Week 5: Logistic Regression & Neural Networks

Overfitting, Bias / Variance Analysis

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

STA141C: Big Data & High Performance Statistical Computing

Day 3 Lecture 3. Optimizing deep networks

Machine Learning for NLP

Lecture 1: Supervised Learning

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

ECE521 week 3: 23/26 January 2017

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Big Data Analytics. Lucas Rego Drumond

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Ch 4. Linear Models for Classification

Classification Logistic Regression

Multiclass Logistic Regression

Homework #3 RELEASE DATE: 10/28/2013 DUE DATE: extended to 11/18/2013, BEFORE NOON QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FORUM.

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Introduction to Machine Learning

Numerical Optimization

ECS289: Scalable Machine Learning

Logistic Regression. William Cohen

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Reading Group on Deep Learning Session 1

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Logistic Regression. COMP 527 Danushka Bollegala

Undirected Graphical Models

Machine Learning Practice Page 2 of 2 10/28/13

Deep Learning & Neural Networks Lecture 4

Introduction to Machine Learning

Classification Based on Probability

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Nonparametric Bayesian Methods (Gaussian Processes)

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Is the test error unbiased for these programs?

Lecture 3 Feedforward Networks and Backpropagation

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

CPSC 340 Assignment 4 (due November 17 ATE)

Lecture 16 Solving GLMs via IRWLS

Machine Learning Lecture 5

Transcription:

Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set of notes is based on internet resources and KP Murphy (2012). Machine learning: a probabilistic perspective. MIT Press. (Chapter 8) Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT press. www.deeplearningbook.org. (Chapter 4) Andrew Ng. Lecture Notes on Machine Learning. Stanford. Hal Daume. A Course on Machine Learning. http://ciml.info/ Nevin L. Zhang (HKUST) Machine Learning 1 / 50

Logistic Regression Outline 1 Logistic Regression 2 Gradient Descent 3 Newton s Method 4 Softmax Regression 5 Optimization Approach to Classification Nevin L. Zhang (HKUST) Machine Learning 2 / 50

Logistic Regression Recap of Probabilistic Linear Regression Training set D = {x i, y i } N i=1, where y i R. Probabilistic model: p(y x, θ) = N (y µ(x), σ 2 ) = N (y w x, σ 2 ) Learning: Determining w by minimizing the negative loglikelihood NLL(w, σ) = log P(D w, σ) Point estimation of y: ŷ = µ(x) = w x Nevin L. Zhang (HKUST) Machine Learning 3 / 50

Logistic Regression Logistic Regression (for Classification) Training set D = {x i, y i } N i=1, where y i {0, 1}. Probabilistic model: p(y x, w) = Ber(y σ(w x)) σ(z) is the sigmoid/logistic/logit function. σ(z) = 1 1 + exp( z) = ez e z + 1 It maps the real line R to [0, 1]. Not to be confused with the variance σ 2 in the Gaussian distribution Nevin L. Zhang (HKUST) Machine Learning 4 / 50

Logistic Regression Logistic Regression Rephrase of the probabilistic model, log p 1 p log p(y = 1 x, w) = σ(w x) = p(y = 1 x, w) 1 p(y = 1 x, w) is called the logit of p. = w x 1 1 + exp( w x) Learning: Determining w by minimizing the negative loglikelihood NLL(w) = log P(D w) Nevin L. Zhang (HKUST) Machine Learning 5 / 50

Logistic Regression Logistic Regression: Decision Rule To classify instances, we obtain a point estimation of y: ŷ = arg max y p(y x, w) In other words, the decision/classification rule is: ŷ = 1 iff p(y = 1 x, w) > 0.5 This is called the optimal Bayes classifier: Suppose the same situation is to occur many times. You will always make mistakes no matter what decision rule to use. The probability of mistakes is minimized if you use the above rule. Nevin L. Zhang (HKUST) Machine Learning 6 / 50

Logistic Regression Logistic Regression: Example Solid black dots are the data. Those at the bottom are the SAT scores of applicants rejected by a university, and those at the top are the SAT scores of applicants accepted by a university. The red circles are the predicted probabilities that the applicants would be accepted. Nevin L. Zhang (HKUST) Machine Learning 7 / 50

Logistic Regression Logistic Regression: 2D Example The decision boundary p(y = 1 x, w) = 0.5 is a straight line in the feature space. Nevin L. Zhang (HKUST) Machine Learning 8 / 50

Logistic Regression Logistic Regression is a Linear Classifier In fact, the decision/classification rule in logistic regression is equivalent to: ŷ = 1 iff w x > 0 So,it is a linear classifier with a decision boundary. Example: Whether a students is admitted based on the results of two exams: Nevin L. Zhang (HKUST) Machine Learning 9 / 50

Logistic Regression Some Math Since P(y = 1 x, w) = σ(w x), we have P(y = 0 x, w) = 1 σ(w x). We can combine the two equations into one: P(y x, w) = (σ(w x)) y (1 σ(w x)) 1 y Hence, log P(y x, w) = y log σ(w x) + (1 y) log(1 σ(w x)) Nevin L. Zhang (HKUST) Machine Learning 10 / 50

Logistic Regression Model fitting We would like to find the MLE of w, i.e, the values of w that minimizes the negative loglikelihood: NLL(w) = log P(D w) = = N log P(y i x i, w) i=1 N [y i log σ(w x i ) + (1 y i ) log(1 σ(w x i ))] i=1 Unlike linear regression, we can no longer write down the MLE in closed form. Instead, we need to use optimization algorithms to compute it. Gradient descent Newton s method Nevin L. Zhang (HKUST) Machine Learning 11 / 50

Gradient Descent Outline 1 Logistic Regression 2 Gradient Descent 3 Newton s Method 4 Softmax Regression 5 Optimization Approach to Classification Nevin L. Zhang (HKUST) Machine Learning 12 / 50

Gradient Descent Gradient Descent Consider a function y = J(w) of a scalar variable w. The derivative of J(w) is defined as follows: J (w) = When ɛ is small, we have df (w) dw = lim J(w + ɛ) J(w) ɛ 0 ɛ J(w + ɛ) J(w) + ɛj (w) This equation tells use how to reduce J(w) by changing w in small steps: If J (w) > 0, make ɛ negative, i.e., decrease w; If J (w) < 0, make ɛ positive, i.e., increase w. In other words, move in the opposite direction of the derivative (gradient) Nevin L. Zhang (HKUST) Machine Learning 13 / 50

Gradient Descent Gradient Descent To implement the idea, we update w as follows: w w αj (w) The term J (w) means that we move in the opposite direction of the derivative, and α determines how much we move in that direction. It is called the step size in optimization and the learning rate in machine learning Nevin L. Zhang (HKUST) Machine Learning 14 / 50

Gradient Descent Gradient Descent Consider a function y = J(w), where w = (w 0, w 1,..., w D ). The gradient of J at w is defined as w J = ( dj dw 0, dj dw 1,..., dj dw D ) The gradient is the direction along which J increases the fastest. If we want to reduce J as fast as possible, move in the opposite direction of the gradient. Nevin L. Zhang (HKUST) Machine Learning 15 / 50

Gradient Descent Gradient Descent The method of steepest descent/gradient descent for minimizing J(w) 1 Initialize w 2 Repeat until convergence w w α w J(w) The learning rate α usually changes from iteration to iteration. Nevin L. Zhang (HKUST) Machine Learning 16 / 50

Gradient Descent Choice of Learning Rate Constant learning rate is difficult to set: Too small, convergence will be very slow Too large, the method can fail to converge at all. Better to vary the learning rate. Will discuss this more in L07. Nevin L. Zhang (HKUST) Machine Learning 17 / 50

Gradient Descent Gradient Descent Gradient descent can get stuck at local minima or saddle points Nonetheless, it usually works well. Nevin L. Zhang (HKUST) Machine Learning 18 / 50

Gradient Descent Derivative of the Sigmoid Function To apply gradient descent to logistic regression, we need to compute the partial derivative of the NLL w.r.t each weight w j. Before doing that, first consider the derivative of the sigma function: σ (z) = dσ(z) dz 1 1 + e z = d dz 1 d(1 + e z ) = (1 + e z ) 2 dz 1 = (1 + e z ) 2 e z 1 = 1 + e z (1 1 1 + e z ) = σ(z)(1 σ(z)) Nevin L. Zhang (HKUST) Machine Learning 19 / 50

Gradient Descent Derivative of NLL Let z i = w x i. NLL(w) w j = = = = = N i=1 [y i log σ(z i ) w j + (1 y i ) log(1 σ(z i)) w j ] N 1 [y i σ(z i ) (1 y 1 i) 1 σ(z i ) ] σ(z i) w j i=1 N 1 [y i σ(z i ) (1 y 1 i) 1 σ(z i ) ]σ(z i)(1 σ(z i )) z i w j i=1 N [y i (1 σ(z i )) (1 y i )σ(z i )]x i,j i=1 N [y i σ(z i )]x i,j i=1 Nevin L. Zhang (HKUST) Machine Learning 20 / 50

Gradient Descent Batch Gradient Descent The Batch Gradient Descent algorithm for logistic regression Repeat until convergence w j w j + α N [y i σ(w x i )]x i,j Interpretation: (Assume x i are positive vectors) If predicted value σ(w x i ) is smaller than the actual value y i, there is reason to increase w j. The increment is proportional to x i,j. If predicted value σ(w x i ) is larger than the actual value y i, there is reason to decrease w j. The decrement is proportional to x i,j. i=1 Nevin L. Zhang (HKUST) Machine Learning 21 / 50

Gradient Descent Stochastic Gradient Descent Batch gradient descent is costly when N is large. In such case, we usually use Stochastic Gradient Descent: Repeat until convergence Randomly choose B {1, 2,..., N} w j w j + α i B[y i σ(w x i )]x i,j The randomly picked subset B is called a minibatch. Its size is called the batch size. Nevin L. Zhang (HKUST) Machine Learning 22 / 50

Gradient Descent L 2 Regularization Just as we prefer ridge regression to linear regression, we prefer to add a regularization term in the objective function of logistic regression: J(w) = NLL(w) + 1 2 λ w 2 2. In this case, we have J(w) w j N = [y i σ(z i )]x i,j + λw j i=1 The weight update formula is: w j w j + α[ λw j + i B(y i σ(w x i ))x i,j ] Nevin L. Zhang (HKUST) Machine Learning 23 / 50

Gradient Descent L 2 Regularization Update rule without regularization: w j w j + α i B[y i σ(w x i )]x i,j Update rule with regularization: w j (1 αλ)w j + α i B(y i σ(w x i ))x i,j Regularization forces the weights to be smaller: as 1 > αλ > 0. (1 αλ)w j < w j Nevin L. Zhang (HKUST) Machine Learning 24 / 50

Newton s Method Outline 1 Logistic Regression 2 Gradient Descent 3 Newton s Method 4 Softmax Regression 5 Optimization Approach to Classification Nevin L. Zhang (HKUST) Machine Learning 25 / 50

Newton s Method Newton s Method Consider again a function J(w) of a scalar variable w. We want to minimize J(w). At a minimum point, the derivative J (w) = 0. Let f (w) = J (w). We want to find points where f (w) = 0. Newton s method (aka the Newton-Raphson method) is a commonly used method to solve the problem: Start some initial value w Repeat the following update until convergence w w f (w) f (w) Nevin L. Zhang (HKUST) Machine Learning 26 / 50

Newton s Method Newton s Method Interpretation of Newton s method Approximate the function f via a linear function that is tangent to f at the current guess w i : y = f (w i )w + f (w i ) f (w i )w i Solve for where that linear function equals to zero to get next w. f (w i )w + f (w i ) f (w i )w i = 0 w i+1 = w i f (w i) f (w i ) Nevin L. Zhang (HKUST) Machine Learning 27 / 50

Newton s Method Newton s Method Minimize J(w) using Newton s method: Start some initial value w Repeat the following update until convergence w w J (w) J (w) where f is the second derivative of f. Nevin L. Zhang (HKUST) Machine Learning 28 / 50

Newton s Method Newton s Method New consider minimizing a J(w) of a vector w = (w 0, w 1,..., w D ). The first derivative of J(w) is the gradient vector w J(w). The second derivative of J(w) is the Hessian matrix H = [H ij ] H ij = 2 J(w) w i w j Minimize J(w) using Newton s method Start some initial value w Repeat the following update until convergence w w H 1 w J(w) Nevin L. Zhang (HKUST) Machine Learning 29 / 50

Newton s Method Newton s Method: Another Perspective By Taylor expansion, we have J(w + ɛ) J(w) + J (w)ɛ + 1 2 J (w)ɛ 2 We want dj(w+ɛ) dɛ = 0. So, Solving the equation, we get J (w) + J (w)ɛ = 0 ɛ = J (w) J (w) Nevin L. Zhang (HKUST) Machine Learning 30 / 50

Newton s Method Newton s Method vs Gradient Descent In gradient descent, we have only a first order term in the approximation (gradient). J(w + ɛ) J(w) + ɛj (w) In Newton s method, we also have a second order term in the approximation (curvature) J(w + ɛ) J(w) + J (w)ɛ + 1 2 J (w)ɛ 2 Nevin L. Zhang (HKUST) Machine Learning 31 / 50

Newton s Method Newton s Method vs Gradient Descent Use of curvature allows us to better predict how J(w) changes when we change w 1 With negative curvature, J actually decreases faster than the gradient predicts. 2 With no curvature, the gradient predicts the decrease correctly. 3 With positive curvature, the function decreases more slowly than expected. The use of curvature mitigates with problems with case 1 and 3. Nevin L. Zhang (HKUST) Machine Learning 32 / 50

Newton s Method One iteration of Newtons can, however, be more expensive than one iteration of gradient descent, since it requires finding and inverting an Hessian matrix. Nevin L. Zhang (HKUST) Machine Learning 33 / 50 Newton s Method vs Gradient Descent Newton s method (red) typically enjoys faster convergence than (batch) gradient descent (green), and requires many fewer iterations to get very close to the minimum.

Softmax Regression Outline 1 Logistic Regression 2 Gradient Descent 3 Newton s Method 4 Softmax Regression 5 Optimization Approach to Classification Nevin L. Zhang (HKUST) Machine Learning 34 / 50

Softmax Regression Multi-Class Logistic Regression Training set D = {x i, y i } N i=1, where y i {1, 2,..., C}, and C 2. Multi-Class Logistic Regression (aka softmax regression) uses the following probability model: There is a weight vector w c = (w c,0, w c,1,..., w c,d ) for each class c. W = (w 1,..., w C ) is the weight matrix. The probability of a particular class c given x is: p(y = c x, W) = 1 Z(x, W) exp(w c x) where Z(x, W) = C c =1 exp(w c x) is the normalization constant to ensure the probabilities of all classes sum to 1. It is called the partition function. The classification rule is: ŷ = arg max p(y x, W) y Nevin L. Zhang (HKUST) Machine Learning 35 / 50

Softmax Regression Relationship with Logistic Regression When C = 2, name the two classes as {0, 1} instead of {1, 2}: p(y = 0 x, W) = exp(w0 x) exp(w0 x) + exp(w 1 x), p(y = 1 x, W) = exp(w1 x) exp(w0 x) + exp(w 1 x) Dividing both numerator and denominator by exp(w 1 x), we get p(y = 0 x, W) = exp((w 0 w 1 )x) exp((w 0 w 1 )x) + 1, p(y = 1 x, W) = 1 exp((w 0 w 1 )x) + 1 Let w = w 1 w 0, we get the logistic regression model: p(y = 0 x, w) = exp( w x) 1 + exp( w x), p(y = 1 x, w) = 1 exp(1 + w x) Nevin L. Zhang (HKUST) Machine Learning 36 / 50

Softmax Regression Negative Loglikelihood The negative loglikelihood of the softmax regression model is: NLL(w) = log P(D w) = = = = = N i=1 c=1 N i=1 c=1 N i=1 c=1 N log P(y i x i, w) i=1 C 1(y i = c) log P(y i = c x i, w) C 1(y i = c) log C 1(y i = c)[wc x i log N C [ 1(y i = c)wc x i log i=1 c=1 exp(w c x i ) C c =1 exp(w c x i) C exp(wc x i)] c =1 C exp(wc x i)] c =1 Nevin L. Zhang (HKUST) Machine Learning 37 / 50

Softmax Regression Partial Derivative of NLL We would like to find the MLE of W. To do so, we need to minimize the NLL. There is no closed-form solution. So, we resort to gradient descent. The partial derivative of NLL w.r.t each w c,j is: NLL(w) w c,j = = = N [ i=1 i=1 1 w c,j C c=1 1(y i = c)w c x i 1 w c,j log N exp(wc x i ) [1(y i = c)x i,j C c =1 exp(w c x i) x i,j] N [1(y i = c) p(y i = c x i, W)]x i,j i=1 Compare this to the gradient of logistic regression: NLL(w) N = [y i σ(w x i )]x i,j w j i=1 C exp(wc x i)] Nevin L. Zhang (HKUST) Machine Learning 38 / 50 c =1

Softmax Regression Gradient of NLL The gradient of NLL w.r.t each w c is: wc NLL(w) = ( NLL(w) = w c,0, NLL(w) w c,1,..., NLL(w) ) w c,d N [1(y i = c) p(y i = c x i, W)]x i i=1 Update rule in Gradient Descent: w c w c + α N [1(y i = c) p(y i = c x i, W)]x i i=1 Nevin L. Zhang (HKUST) Machine Learning 39 / 50

Optimization Approach to Classification Outline 1 Logistic Regression 2 Gradient Descent 3 Newton s Method 4 Softmax Regression 5 Optimization Approach to Classification Nevin L. Zhang (HKUST) Machine Learning 40 / 50

Optimization Approach to Classification The Probabilistic Approach to Classification So far, we have been talking about the probabilistic approach to classification: Start with training set D = {x i, y i } N i=1, where y i {0, 1}. Learn a probabilistic model: p(y x, w) = Ber(y σ(w x)) by minimizing the negative loglikelihood Classify new instances using: NLL(w) = log P(D w) ŷ = 1 iff p(y = 1 x, w) > 0.5 Nevin L. Zhang (HKUST) Machine Learning 41 / 50

Optimization Approach to Classification The Optimization Approach Next, we will briefly talk about the optimization approach to classification: Start with training set D = {x i, y i } N i=1, where y i { 1, 1}, Learn a linear classifier ŷ = sign(w x + b) where x = (x 1,..., x D ) and w = (w 1,..., w D ). We drop the convention x 0 = 1 and use b to replace w 0. Nevin L. Zhang (HKUST) Machine Learning 42 / 50

Optimization Approach to Classification The Sign Function signx = 1 if x > 0; signx = 0 if x = 0; signx = 1 if x < 0. Nevin L. Zhang (HKUST) Machine Learning 43 / 50

Optimization Approach to Classification The Optimization Approach We determine w and b by minimizing the training/empirical error L(w, b) = N L 0/1 (y i, ŷ i ) i=1 where L 0/1 (y i, ŷ i ) = 1(y i ŷ i ) is called the zero/one loss function. Let z = w x + b. It is easy to see that L 0/1 (y, ŷ) = 1(y(w x + b) 0) = 1(yz 0) To overload the notation, we also write it as L 0/1 (y, z) So, the loss function of linear classifiers is often written as L(w, b) = N 1(y i (w x i + b) 0) i=1 Nevin L. Zhang (HKUST) Machine Learning 44 / 50

Optimization Approach to Classification A problem the Zero/One Loss Function The zero/one loss function, as a function of w and b, is not convex and hence difficult to optimize X -axis is yz. Nevin L. Zhang (HKUST) Machine Learning 45 / 50

Optimization Approach to Classification Convex Function A real-valued function f is convex if the line segment between any two points on the graph of the function lies above or on the graph. For any two points x 1 < x 2 and any t [0, 1] f (tx 1 + (1 t)x 2 ) tf (x 1 ) + (1 t)f (x 2 ) f is strictly convex if the the equality holds only when x 1 = x 2. Nevin L. Zhang (HKUST) Machine Learning 46 / 50

Optimization Approach to Classification Surrogate Loss Functions Convex functions are easy to minimize. Intuitively, if you drop a ball anywhere in a convex function, it will eventually get to the minimum. This is not true for non-convex functions. Since the zero/one loss is hard to optimize, we want to approximate it using a convex function and optimize that function. This approximating function will be called a surrogate loss. The surrogate losses need to be upper bounds on the true loss function: this guarantees that if you minimize the surrogate loss, you are also pushing down the real loss. Nevin L. Zhang (HKUST) Machine Learning 47 / 50

Optimization Approach to Classification Logistic Loss One common surrogate loss function is the logistic loss: 1 L log (y, z) = log(1 + exp( yz)) log 2 = { 1 log 2 log(1 + exp( (w x + b))) if y = 1 1 log 2 log(1 + exp(w x + b)) if y = 1 On the other hand, each term in the NLL of logistic regression has the following form log P(y x, w) = {y log σ(w x) + (1 y) log(1 σ(w x))} { log 1 if y = 1 1+exp( w = x) 1 log(1 1+exp( w x) ) if y = 0 { log(1 + exp( w = x)) if y = 1 log(1 + exp(w x)) if y = 0 So, logistic regression is the same as linear classifier with logistic surrogate loss function. Nevin L. Zhang (HKUST) Machine Learning 48 / 50

Optimization Approach to Classification Other Surrogate Loss Functions Zero/one loss: L 0/1 (y, z) = 1(yz 0). Logistic loss: L log (y, z) = 1 log 2 log(1 + exp( yz)) Hinge loss: L hin (y, z) = max{0, 1 yz} Exponential loss: L exp (y, z) = exp( yz) Squared loss: L sqr (y, z) = (y z) 2 Nevin L. Zhang (HKUST) Machine Learning 49 / 50

Optimization Approach to Classification Loss and Error When a surrogate loss function is used, There are two ways to measure how a classifier performs on the training set: training loss and training error. There are also two ways to measure how a classifier performs on the set set: test loss and test error. As in the case of regression, test loss/error can be improved using regularization. Nevin L. Zhang (HKUST) Machine Learning 50 / 50