CS229 Supplemental Lecture notes

Similar documents
CS229 Supplemental Lecture notes

Linear Models in Machine Learning

Regression with Numerical Optimization. Logistic

Logistic Regression. Machine Learning Fall 2018

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Logistic Regression Trained with Different Loss Functions. Discussion

Classification Based on Probability

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Lecture 2 Machine Learning Review

Lecture 3 - Linear and Logistic Regression

Support Vector Machines and Bayes Regression

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Support Vector Machines

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Warm up: risk prediction with logistic regression

Machine Learning Lecture 7

Minimax risk bounds for linear threshold functions

Machine Learning. Lecture 2: Linear regression. Feng Li.

Logistic Regression. COMP 527 Danushka Bollegala

Probabilistic Graphical Models & Applications

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Binary Classification / Perceptron

Deep Learning for Computer Vision

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Part of the slides are adapted from Ziko Kolter

CSC 411: Lecture 04: Logistic Regression

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Linear Classifiers. Michael Collins. January 18, 2012

Empirical Risk Minimization

ECE521 Lecture7. Logistic Regression

CSC321 Lecture 4: Learning a Classifier

Machine Learning Basics III

The Perceptron Algorithm 1

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

CS60021: Scalable Data Mining. Large Scale Machine Learning

Generative v. Discriminative classifiers Intuition

Error Functions & Linear Regression (2)

Online Learning and Sequential Decision Making

Variations of Logistic Regression with Stochastic Gradient Descent

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Bias-Variance Tradeoff

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Machine Learning and Data Mining. Linear classification. Kalev Kask

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Linear discriminant functions

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Logistic regression and linear classifiers COMS 4771

Introduction to Logistic Regression

Linear & nonlinear classifiers

Bayesian Machine Learning

Machine Learning for NLP

CSC321 Lecture 4: Learning a Classifier

Lecture 4 Logistic Regression

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

MLCC 2017 Regularization Networks I: Linear Models

HOMEWORK #4: LOGISTIC REGRESSION

Generative v. Discriminative classifiers Intuition

Logistic Regression. Stochastic Gradient Descent

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Overfitting, Bias / Variance Analysis

Machine Learning for NLP

Lecture 11. Linear Soft Margin Support Vector Machines

Linear classifiers: Logistic regression

Training the linear classifier

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Linear and Logistic Regression

Machine Learning 4771

Generative Learning algorithms

Logistic Regression & Neural Networks

ECE521 week 3: 23/26 January 2017

Introduction to Machine Learning

Midterm exam CS 189/289, Fall 2015

Machine Learning Practice Page 2 of 2 10/28/13

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Stochastic Gradient Descent

Introduction to Logistic Regression and Support Vector Machine

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Learning with multiple models. Boosting.

Classification: Logistic Regression from Data

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Least Mean Squares Regression

Linear and Logistic Regression. Dr. Xiaowei Huang

18.6 Regression and Classification with Linear Models

Machine Learning And Applications: Supervised Learning-SVM

Linear Models for Classification

Machine Learning

Statistical Data Mining and Machine Learning Hilary Term 2016

Transcription:

CS229 Supplemental Lecture notes John Duchi Binary classification In binary classification problems, the target y can take on at only two values. In this set of notes, we show how to model this problem by letting y {,+}, where we say that y is a if the example is a member of the positive class and y = if the example is a member of the negative class. We assume, as usual, that we have input features x R n. As in our standard approach to supervised learning problems, we first pick a representation for our hypothesis class (what we are trying to learn), and after that we pick a loss function that we will minimize. In binary classification problems, it is often convenient to use a hypothesis class of the form h θ (x) = θ T x, and, when presented with a new example x, we classify it as positive or negative depending on the sign of θ T x, that is, our predicted label is if t > 0 sign(h θ (x)) = sign(θ T x) where sign(t) = 0 if t = 0 if t < 0. In a binary classification problem, then, the hypothesis h θ with parameter vector θ classifies a particular example (x, y) correctly if sign(θ T x) = y or equivalently yθ T x > 0. () The quantity yθ T x in expression () is a very important quantity in binary classification, important enough that we call the value yx T θ the margin for the example (x, y). Often, though not always, one interprets the value h θ (x) = x T θ as a measure of the confidence that the parameter

vector θ assigns to labels for the point x: if x T θ is very negative (or very positive), then we more strongly believe the label y is negative (or positive). Now that we have chosen a representation for our data, we must choose a loss function. Intuitively, we would like to choose some loss function so that for our training data {(x (i),y (i) )} m, the θ chosen makes the margin y (i) θ T x (i) very largefor eachtrainingexample. Letusfixahypotheticalexample(x,y), letz = yx T θ denotethemargin, andletϕ : R Rbethelossfunction that is, the loss for the example (x,y) with margin z = yx T θ is ϕ(z) = ϕ(yx T θ). For any particular loss function, the empirical risk that we minimize is then J(θ) = m m ϕ(y (i) θ T x (i) ). (2) Consider our desired behavior: we wish to have y (i) θ T x (i) positive for each training example i =,...,m, and we should penalize those θ for which y (i) θ T x (i) < 0 frequently in the training data. Thus, an intuitive choice for our loss would be one with ϕ(z) small if z > 0 (the margin is positive), while ϕ(z) is large if z < 0 (the margin is negative). Perhaps the most natural such loss is the zero-one loss, given by { if z 0 ϕ zo (z) = 0 if z > 0. In this case, the risk J(θ) is simply the average number of mistakes misclassifications the parameter θ makes on the training data. Unfortunately, the loss ϕ zo is discontinuous, non-convex (why this matters is a bit beyond the scope of the course), and perhaps even more vexingly, NP-hard to minimize. So we prefer to choose losses that have the shape given in Figure. That is, we will essentially always use losses that satisfy ϕ(z) 0 as z, while ϕ(z) as z. As a few different examples, here are three loss functions that we will see either now or later in the class, all of which are commonly used in machine learning. (i) The logistic loss uses ϕ logistic (z) = log(+e z ) 2

ϕ z = yx T θ Figure: Theroughshapeoflosswedesire: thelossisconvexandcontinuous, and tends to zero as the margin z = yx T θ. (ii) The hinge loss uses (iii) The exponential loss uses ϕ hinge (z) = [ z] + = max{ z,0} ϕ exp (z) = e z. In Figure 2, we plot each of these losses against the margin z = yx T θ, noting that each goes to zero as the margin grows, and each tends to + as the margin becomes negative. The different loss functions lead to different machine learning procedures; in particular, the logistic loss ϕ logistic is logistic regression,thehingelossϕ hinge givesrisetoso-calledsupport vector machines, and the exponential loss gives rise to the classical version of boosting, both of which we will explore in more depth later in the class. 2 Logistic regression With this general background in place, we now we give a complementary view of logistic regression to that in Andrew Ng s lecture notes. When we 3

ϕ logistic ϕ hinge ϕ exp z = yx T θ Figure 2: The three margin-based loss functions logistic loss, hinge loss, and exponential loss. use binary labels y {,}, it is possible to write logistic regression more compactly. In particular, we use the logistic loss ϕ logistic (yx T θ) = log ( +exp( yx T θ) ), and the logistic regression algorithm corresponds to choosing θ that minimizes J(θ) = m ϕ logistic (y (i) θ T x (i) ) = m log ( +exp( y (i) θ T x (i) ) ). (3) m m Roughly, we hope that choosing θ to minimize the average logistic loss will yield a θ for which y (i) θ T x (i) > 0 for most (or even all!) of the training examples. 2. Probabilistic intrepretation Similar to the linear regression (least-squares) case, it is possible to give a probabilistic interpretation of logistic regression. To do this, we define the 4

0-5 5 Figure 3: Sigmoid function sigmoid function (also often called the logistic function) g(z) = +e z, which is plotted in Fig. 3. In particular, the sigmoid function satisfies g(z)+g( z) = +e z + +e z = ez +e z + +e z =, so we can use it to define a probability model for binary classification. In particular, for y {,}, we define the logistic model for classification as p(y = y x;θ) = g(yx T θ) = +e yxt θ. (4) For intepretation, we see that if the margin yx T θ is large bigger than, say, 5 or so then p(y = y x;θ) = g(yx T θ), that is, we assign nearly probability to the event that the label is y. Conversely, if yx T θ is quite negative, then p(y = y x;θ) 0. By redefining our hypothesis class as h θ (x) = g(θ T x) = 5 +e θt x,

then we see that the likelihood of the training data is L(θ) = m p(y = y (i) x (i) ;θ) = m h θ (y (i) x (i) ), and the log-likelihood is precisely l(θ) = m logh θ (y (i) x (i) ) = m ( log +e y(i) θ T x (i)) = mj(θ), where J(θ) is exactly the logistic regression risk from Eq. (3). That is, maximum likelihood in the logistic model (4) is the same as minimizing the average logistic loss, and we arrive at logistic regression again. 2.2 Gradient descent methods The final part of logistic regression is to actually fit the model. As is usually the case, we consider gradient-descent-based procedures for performing this minimization. With that in mind, we now show how to take derivatives of the logistic loss. For ϕ logistic (z) = log(+e z ), we have the one-dimensional derivative d dz ϕ logistic(z) = ϕ logistic(z) = +e z d dz e z = e z +e = z +e = g( z), z where g is the sigmoid function. Then we apply the chain rule to find that for a single training example (x,y), we have θ k ϕ logistic (yx T θ) = g( yx T θ) θ k (yx T θ) = g( yx T θ)yx k. Thus, a stochastic gradient procedure for minimization of J(θ) iteratively performs the following for iterations t =,2,..., where α t is a stepsize at time t:. Choose an example i {,...,m} uniformly at random 2. Perform the gradient update θ (t+) = θ (t) α t θ ϕ logistic (y (i) x (i)t θ (t) ) = θ (t) +α t g( y (i) x (i)t θ (t) )y (i) x (i) = θ (t) +α t h θ (t)( y (i) x (i) )y (i) x (i). 6

This update is intuitive: if our current hypothesis h θ (t) assigns probability close to for the incorrect label y (i), then we try to reduce the loss by moving θ in the direction of y (i) x (i). Conversely, if our current hypothesis h θ (t) assigns probability close to 0 for the incorrect label y (i), the update essentially does nothing. 7