Introduction to Logistic Regression

Similar documents
Probabilistic Machine Learning. Industrial AI Lab.

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Logistic Regression. Machine Learning Fall 2018

ECE521 week 3: 23/26 January 2017

Linear classifiers: Logistic regression

Online Advertising is Big Business

Logistic Regression. Stochastic Gradient Descent

Neural Network Training

Logistic Regression. COMP 527 Danushka Bollegala

Linear classifiers: Overfitting and regularization

Linear & nonlinear classifiers

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

CPSC 340 Assignment 4 (due November 17 ATE)

Collaborative Filtering. Radek Pelánek

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Online Advertising is Big Business

ECE521 Lecture7. Logistic Regression

Midterm exam CS 189/289, Fall 2015

Machine Learning Lecture 5

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

CS60021: Scalable Data Mining. Large Scale Machine Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning Practice Page 2 of 2 10/28/13

Introduction to Machine Learning

Ad Placement Strategies

Linear discriminant functions

Lecture 6. Notes on Linear Algebra. Perceptron

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Linear Models in Machine Learning

Logistic Regression Logistic

Statistical Data Mining and Machine Learning Hilary Term 2016

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

CMU-Q Lecture 24:

Warm up: risk prediction with logistic regression

Binary Classification / Perceptron

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

ECE 5984: Introduction to Machine Learning

Logistic Regression & Neural Networks

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

CS229 Supplemental Lecture notes

Ad Placement Strategies

Statistical Machine Learning Hilary Term 2018

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Logistic Regression. William Cohen

Lecture 6. Regression

Classification Logistic Regression

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Support Vector Machines

Bayesian Learning (II)

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Bayesian Support Vector Machines for Feature Ranking and Selection

Probabilistic Machine Learning

1 What a Neural Network Computes

10-701/ Machine Learning - Midterm Exam, Fall 2010

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

9 Classification. 9.1 Linear Classifiers

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Click Prediction and Preference Ranking of RSS Feeds

Linear Models for Regression CS534

Machine Learning for NLP

Tufts COMP 135: Introduction to Machine Learning

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

HOMEWORK #4: LOGISTIC REGRESSION

Neural Networks and Deep Learning

Training the linear classifier

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Is the test error unbiased for these programs?

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Linear Models for Regression CS534

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Lecture 2 Machine Learning Review

18.6 Regression and Classification with Linear Models

Kernel Methods and Support Vector Machines

DATA MINING AND MACHINE LEARNING

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Bias-Variance Tradeoff

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Machine Learning Basics

Machine Learning and Data Mining. Linear classification. Kalev Kask

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Part of the slides are adapted from Ziko Kolter

Nonlinear Classification

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Classification Based on Probability

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

CS 6375 Machine Learning

Machine Learning Lecture 7

Conjugate-Gradient. Learn about the Conjugate-Gradient Algorithm and its Uses. Descent Algorithms and the Conjugate-Gradient Method. Qx = b.

Least Squares Regression

Transcription:

Introduction to Logistic Regression Guy Lebanon Binary Classification Binary classification is the most basic task in machine learning, and yet the most frequent. Binary classifiers often serve as the foundation for many high tech ML applications such as ad placement, feed ranking, spam filtering, and recommendation systems. Below, we assume that x is a d-dimensional vector x = (x,..., x d ) of real numbers, and y is a scalar denoting the label: + or -. We will denote by the vector of parameters defining the classifier, whose dimensionality is the same as that of x: = (,..., d ). The inner product between and x is defined as follows x, = x + + d x d. The binary classification task is defined as follows: Given a vector of features x, assign a label y of + or -. A binary classifier is a rule that makes such mapping for arbitrary vectors x. The classifier is typicaly learned based on training data composed of n labeled vectors: (x (), y () ),..., (x (n), y (n) ). Note that each x (i) is a vector and y (i) its + or - label. For example in the case of the LinkedIn feed, x can contain measurements characterizing the user and item, and y represents whether the user clicks or not on the item. An example for a feature could be x 5 = if the user has more than 500 connections and the item is a connection update. There could be thousands or more such features. The training data is a sequence of user-item and click/no-click information, and at serving time the classifier tries to predict whether the user will click on the feed item or not. 2 Linear Classifiers and Hyperplanes A linear classifer is a classifier that has the following algebraic form: y = sign(, x ). That is, the predicted label of the vector x is the sign of the inner product of that vector with another vector that is called the parameter vector of the classifier. If x, is positive the predicted class is + and if it is negative the predicted class is -. Despite their simplicity (and probably because of it), linear classifiers are the most widely used. There are several reasons: (a) they are easy to train, (b) they can predict labels very fast at serve time (if x is sparse i.e., mostly zeros, the inner product computation, x is extremeley fast), (c) and we know quite well the statistical theory of linear classifiers leading to effective modeling strategies. Arguably, the most widely used linear classifier is logistic regression. Logistic regression, or its variations, power main functions in big companies like Google, LinkedIn, and Amazon, as well as in small startups.

A note on dimensionality: For visualization purposes we should assume that there are only two features i.e., d = 2. In reality, d may be quite higher such as 0 4 0 6. Linear classifiers excel at such high-dimensional cases due to their simplicity, their attractive computational load, and the nice statistical properties. There is another reason why linear classifiers are called linear besides the fact that the classification rule is defined by a linear form, x. That has to do with the fact that if we draw a Cartesian coordinate system and mark the data vectors x as points in that coordinate system, the classification rule of linear classifier boils down to a linear plane decision boundary that is perpendicular to. In other words, (interpreted as a vector in that coordinate systme) defines a hyperplane that is perpendicular to it, and that hyperplane divides the space of data vectors in half - all vectors on one side are predicted to have positive labels and all vectors on the other side are predicted to have negative labels. This can be seen by recalling that x, equals projection of the vector x on the vector times the norm of. In order to make sure that the decision boundary does not have to pass through the origin, we need to in effect to have a classification rule x + + x d d + c = x, + c. The added term c is called a bias term and it enables the decision boundary to not pass through the origin. To simplify notation, however, we assume that all vectors x inculde as one of their dimension, hence the inner product x, already includes the offset term implicitly. Note that we can increase the dimensionality of the data vectors x by adding powers of x or other transformations. For example, given a two dimensional data vector x = (x, x 2 ) we can create a sixth dimensional corresponding vector x, defined by x = (, x, x 2, x 2, x 2 2, x x 2 ). We can then train a linear classifier on the transformed data vectors ( x (), y () ),..., ( x (d), y (n) ) rather than on the original data (x (), y () ),..., (x (d), y (n) ). The resulting classifier will be linear in the coordinate system of x but non-linear in the coordinate system of x. Such feature engineering is often used in building non-linear classifiers using linear classifier implementations. 3 Probabilistic Classifiers and Maximum Likelihood The above definition of classifier defines a map from a vector of features x to a + or - label. It is useful to have also a measure of confidence of that prediction as well as a way to interpret that confidence in an objective manner. Probabilistic classifiers provide that tool by defining the probabilities of the labels + and - given the feature vector x. Recall the fact that the two probabilities must be non-negative and sum to one. The probabilities of the two labels given a feature vector are written as p (Y = X = x) and p (Y = X = x) where is a parameter vector that defines the clasifier. Statistical theory strongly suggests that prbabilistic classifiers should be learned from the labeled vectors (x (), y () ),..., (x (n), y (n) ) using the maximum likelihood estimator (MLE), Unless you are a Bayesian in which case you do not belive in a single classifier representing the truth but rather in a collection of classifiers, each correct with some probability. 2

defined below ˆ MLE = arg max = arg max p (Y = y () X = x () ) p (Y = y (n) X = x (n) ) () log p (Y = y () X = x () ) + + log p (Y = y (n) X = x (n) ). (2) There are several reasons for using the MLE: (a) it converges to the optimal solution in the limit of large data 2, and that convergence occurs at the fastest possible rate of convergence. 4 Logistic Regression There are many probabilistic classifiers, but the most popular one is logistic regression whose probabilistic definition is given below p(y = y X = x) = + exp(y, x ), where y is either + or -. We can ascertain that p(y = x) + p(y = x) = : + exp(, x ) + + exp(, x ) + exp(, x ) + + exp(, x ) = ( + exp(, x ))( + exp(, x )) (3) + exp(, x ) + + exp(, x ) = =. + exp(, x ) + exp(, x ) + (4) Recall that if p(y = x) > 0.5 we have greater confidence of the label than - and vice verse if p(y = x) < 0.5. Thus, the decision boundary is precisely the set of vectors x for which p(y = x) = p(y = x) = 0.5. This occurs when 2 = + exp(, x ) (5) + exp(, x ) = 2 (6), x = log() = 0. (7) In other words, the decision boundary or the set of vectors where the logistic regression classifier is not sure how to classify is precisely the set of vectors x for which, x = 0, which implies that logistic regression is a linear classifier whose decision boundary is the hyperplane that is perpendicular to the vector. If we want to predict the label associated with a feature vector x we can use the prediction rule sign(, x), but if we want to have a measure of confidence of that prediction, interpreted as a probability, we can use p(y = y X = x) = + exp(y, x ). 2 assuming the data is generated based on the logistic regression model family and n while d is fixed. 3

5 Solving the MLE using Iterative Optimization Substituting the logistic regression formula for p(y = y X = x) in (2) we get the following optimization problem ˆ MLE = arg max log = arg max log ( + exp(y (i), x (i) ) ) + exp(y (i), x (i) ) = arg min i= log ( + exp(y (i), x (i) ) ). (8) i= To get the vector that minimizes the above expression, the method of gradient descent is typically used: (a) initialize the dimensions of to random values, (b) for j =,..., d, update j j α n i= log(+exp(y(i),x (i) )) j, (c) repeat step (b) until the updates become smaller than a threshold. The parameter α is the step size, and it is customary to let α decay as the gradient descent iterations increase. In cases when the data vectors x (i) are sparse, the above computation of the partial derivative can be made particularly fast. Since (8) is convex in, the gradient descent algorithm will reach the single global maximum regardelss of the starting point (though that maximum could be at j ± for some j). Gradient descent or its variations work effectively for non-massive data. When the data is very big, the method of stochastic gradient descent is typicallly used: (a) initialize the dimensions of vector to random values, (b) pick one labeled data vector (x (i), y (i) ) randomly, and update for each j =,..., d: i j α log(+exp(y(i),x (i) )) j (c) repeat step (b) until the updates of the dimensions of become too small. 6 High Dimensions, Overfitting, and Regularization When x is high dimensional the issue of overfitting comes up: there are too many parameters to estimate from the available labeled data, resulting in a trained classifier that fits random noise patterns that exist in the data, rather than the rule governing the likely behavior. For example, consider an extreme case (for illustration purposes) of d = 0 6 and n = 2. Obviously we cannot properly learn a million parameters and how they define the linear decision boundary in a million dimensional space based on 2 labeled data vectors. A way to handle this situation is to add a regularization term to the maximum likelihood cost function (2): β 2 + + β 2 d or β + + β d. Such rule penalizes high values of and tempers the MLE in order to prevent it from achieving high values of,..., d unless there is strong evidence in the data for it. The selection of β should be done carefully by experimentation, and in fact the best value of β depends on the situation at hand, and in particular (though not exclusively) on the relationship between d and n. It is customary to try different values of β, for example β = 0.0, 0., 0.5,, 5, 0, 00, and train the classifier for each of these values and later select the classifier that achives the best performance on a separate held out set of labeled vectors. The grid of possible values should be adjusted to make sure that we find a β close to the optimal value. 4 i=

Notes The Shiny App (Web App embedding R code) shows how logistic regression can be used to define non-linear classifiers in the original space https://tonyfischetti.shinyapps.io/interactivelogisticregression. The tutorial http://nbviewer.ipython.org/gist/vietjtnguyen/6655020 shows Python code implemeting logistic regression and visualizing the results and the training process. 5