Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Similar documents
Machine Learning for NLP Extra lecture: multiclass linear classiers

Logistic Regression. COMP 527 Danushka Bollegala

CSC321 Lecture 4: Learning a Classifier

ESS2222. Lecture 4 Linear model

Warm up: risk prediction with logistic regression

Lecture 3: Multiclass Classification

CSC321 Lecture 4: Learning a Classifier

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Machine Learning for NLP

Stochastic gradient descent; Classification

Neural Networks: Backpropagation

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Machine Learning and Data Mining. Linear classification. Kalev Kask

CS229 Supplemental Lecture notes

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

Machine Learning Basics

Support vector machines Lecture 4

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Linear & nonlinear classifiers

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Deep Learning for Computer Vision

Linear Models for Classification

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Overfitting, Bias / Variance Analysis

Announcements - Homework

INTRODUCTION TO DATA SCIENCE

Logistic Regression. Machine Learning Fall 2018

Linear discriminant functions

Learning by constraints and SVMs (2)

Linear and Logistic Regression. Dr. Xiaowei Huang

Kernelized Perceptron Support Vector Machines

Classification Logistic Regression

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Machine Learning for NLP

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines. Machine Learning Fall 2017

Support Vector Machine

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Multiclass Classification-1

Neural Network Training

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Logistic Regression & Neural Networks

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Optimization and Gradient Descent

CPSC 340 Assignment 4 (due November 17 ATE)

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Kernel Methods and Support Vector Machines

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

CPSC 340: Machine Learning and Data Mining

Support Vector Machines

Loss Functions and Optimization. Lecture 3-1

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Linear Models in Machine Learning

Statistical Pattern Recognition

Nonlinear Classification

DATA MINING AND MACHINE LEARNING

Machine Learning A Geometric Approach

Statistical NLP Spring A Discriminative Approach

Machine Learning Practice Page 2 of 2 10/28/13

Max Margin-Classifier

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

CSCI567 Machine Learning (Fall 2018)

Classification objectives COMS 4771

NLP Programming Tutorial 6 - Advanced Discriminative Learning

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning Lecture 6 Note

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning Linear Classification. Prof. Matteo Matteucci

Lecture 4: Training a Classifier

COMP 875 Announcements

Support Vector Machines for Classification and Regression

Least Mean Squares Regression

Stochastic Gradient Descent

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

ECS171: Machine Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

CS145: INTRODUCTION TO DATA MINING

Statistical Methods for Data Mining

Support Vector Machines

Probabilistic Machine Learning. Industrial AI Lab.

1 Machine Learning Concepts (16 points)

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Lecture Support Vector Machine (SVM) Classifiers

Lecture 4: Training a Classifier

Machine Learning Lecture 5

Transcription:

Applied Machine Learning Lecture 5: Linear classifiers, continued Richard Johansson

overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers support vector classification optimizing the LR and SVM objectives

reformulating the perceptron a bit more compactly the perceptron algorithm can be expressed a bit more compactly if we code the positive class and negative class as +1 and -1, respectively for instance >50K +1 <=50K 1

misclassifications and updates, more compactly if y i = 1 (positive), we have a misclassification if w x i is negative y i score 0 and the update: w = w + y i x i if y i = 1 (negative), we have a misclassification if w x i is positive y i score 0 and the update: w = w + y i x i

the perceptron again, a bit more compactly if the y i are coded as +1 or -1: w = (0,..., 0) for (x i, y i ) in the training set score = w x i if y i score 0 w = w + y i x i return w

how can we get the confidence of a linear classifier?

how can we get the confidence of a linear classifier? score = w x large positive score: strong support that x belongs to the positive class large negative score: strong support that x belongs to the negative class near zero: we are unsure

prediction score in scikit-learn clf =... (train a classifier)... scores = clf.decision_function(x)

overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers support vector classification optimizing the LR and SVM objectives

how to interpret the output scores? linear classifiers select the outputs based on a scoring function: score = w x the confidence isn t directly interpretable can we create a model where the output can be interpreted as a probability?

the logistic regression model logistic regression is a method to train a linear classifier that gives a probabilistic output how to get the probability? use a logistic or sigmoid function: P(positive output x) = where e score = np.exp(-score) 1 1 + e score this is formally a probability: always between 0 and 1, sum of probablities of possible outcomes = 1

the logistic / sigmoid function 1.0 P(y = positive x) 0.8 0.6 0.4 0.2 0.0 6 4 2 0 2 4 6 classifier score

conversely P(negative output x) = 1 1 1 + e score = 1 1 + e score

making it a bit more compact if we code the positive class as +1 and the negative class as -1, then we can write the probability a bit more neatly: P(y x) = 1 1 + e y score

in scikit-learn LR is called sklearn.linear_model.logisticregression predict_proba gives the probability output

code example: using a logistic regression classifier

overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers support vector classification optimizing the LR and SVM objectives

recall: the maximum likelihood principle in a probabilistic model, we can train the model by selecing parameters that assign a high probability to the data in our case, the parameters are the weight vector w adjust w so that each output label gets a high probability

the likelihood function formally, the probability of the data is defined by the likelihood function this is the product of the probabilities of all m individual training instances: in our case, this means L(w) = L(w) = P(y 1 x 1 ) P(y m x m ) 1 1 + e y 1 (w x 1 ) 1 1 + e ym (w xm)

rewriting a bit... we rewrite the previous formula as L(w) = 1 1 + e y 1 (w x 1 ) 1 1 + e ym (w xm) log L(w) = Loss(w, x 1, y 1 ) +... + Loss(w, x m, y m ) where Loss(w, x, y) = log(1 + exp( y (w x))) is called the log loss function

plot of the log loss 6 5 4 log loss 3 2 1 0 6 4 2 0 2 4 6 y * classifier score

recall: the fundamental tradeoff in machine learning goodness of fit: the learned classifier should be able to correctly classify the examples in the training data regularization: the classifier should be simple but so far in our LR description, we ve just taken care of the first part! log L(w) = Loss(w, x 1, y 1 ) +... + Loss(w, x m, y m )

regularization in logistic regression models just like we saw for linear regression models (Ridge and Lasso), we can add a regularizer that keeps the weights small most commonly, the L 2 regularizer:... or an L 1 regularizer: w 2 = w 1 w 1 +... + w n w n = w w w 1 = w 1 +... + w n which will do some feature selection

combining the pieces we combine the loss and the regularizer: Loss(w, xi, y i ) + λ w 2 in this formula, λ is a tweaking parameter that controls the tradeoff between loss and regularization note: in some formulations (including scikit-learn), there is a parameter C instead of the λ that is put before the loss C Loss(w, x i, y i ) + w 2

check minimize w C Loss(w, x i, y i ) + w 2 how do we convert this into an algorithm?

overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers support vector classification optimizing the LR and SVM objectives

two-class (binary) linear classifiers a linear classifier is a classifier that is defined in terms of a scoring function like this score = w x this is a binary (2-class) classifier: return the first class if the score > 0... otherwise the second class how can we deal with non-binary (multi-class) problems when using linear classifiers?

decomposing multi-class classification problems idea 1: break down the complex problem into simpler problems, train a classifier for each separately

decomposing multi-class classification problems idea 1: break down the complex problem into simpler problems, train a classifier for each separately one-versus-rest ( long jump ): for each class c, make a binary classifier to distinguish c from all other classes so if there are n classes, there are n classifiers at test time, we select the class giving the highest score one-versus-one ( football league ): for each pair of classes c 1 and c 2, make a classifier to distinguish c 1 from c 2 if there are n classes, there are n (n 1) 2 classifiers at test time, we select the class that has most wins

example assume we re training a classifier of fruits and we have the classes apple, orange, mango in one-vs-rest, we train the following three classifiers: apple vs orange+mango orange vs apple+mango mango vs apple+orange in one-vs-one, we train the following three: apple vs orange apple vs mango orange vs mango

example (continued) we train classifiers to distinguish between apple, orange, and mango, using one-vs-rest so we get wapple, w orange, w mango for some instance x, the respective scores are [-1, 2.2, 1.5] so our guess is orange

in scikit-learn scikit-learn includes implementations of both of the methods we have discussed: OneVsRestClassifier OneVsOneClassifier however, the built-in algorithms (e.g. Perceptron, LogisticRegression) will do this automatically for you they use one-versus-rest

multiclass learning algorithms is it good to separate the multiclass task into smaller tasks that are trained independently? maybe training should be similar to testing? idea 2: make a model where one-vs-rest is used while training let s see how this can be done for logistic regression

binary LR: reminders the logistic or sigmoid function: P(positive output x) = def sigmoid(score): return 1 / (1 + np.exp(-score)) 1 1 + e score when training, we minimize the log-loss Loss(w, x, y) = log(1 + exp( y (w x)))

multiclass LR using the softmax the softmax function is used in multiclass LR instead of the logistic: P(y i x) = escore i k escore k def softmax(scores): expscores = np.exp(scores) return expscores / sum(expscores) [exercise: make softmax numerically stable]

softmax example def softmax(scores): expscores = np.exp(scores) return expscores / sum(expscores) scores = [-1, 2.2, 1.5, -0.3] print(softmax(scores)) array([ 0.02517067, 0.61750026, 0.30664156, 0.05068751]) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0 1 2 3

cross-entropy loss when training, the softmax probabilities lead to the cross-entropy loss instead of the log loss Loss CE (w, x i, y i ) = log P(y i x i ) = log e score i k escore k just like the log-loss: high probability for the correct label yi low loss low probability for y i high loss

multiclass LR in scikit-learn LogisticRegression(multi_class= multinomial ) (otherwise, separate classifiers are trained independently)

overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers support vector classification optimizing the LR and SVM objectives

geometric view geometrically, a linear classifier can be interpreted as separating the vector space into two regions using a line (plane, hyperplane) 4 3 2 1 0 1 2 3 4 3 2 1 0 1 2 3 4

margin of separation the margin γ denotes how well w separates the classes: 4 3 2 1 0 1 2 3 4 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4 3 2 1 0 1 2 3 4 γ is the shortest distance from the separator to the nearest training instance

large margins are good a result from statistical learning theory: true error training error + BigUglyFormula( 1 γ 2 ) larger margin better generalization

support vector machines support vector machines (SVM) or support vector classifiers (SVC) are linear classifiers constructed by selecting the w that maximizes the margin 4 3 2 1 0 1 2 3 4 3 2 1 0 1 2 3 4

soft-margin SVMs in some cases the dataset is inseparable, or nearly inseparable soft-margin SVM: allow some examples to be disregarded when maximizing the margin r x i r x i ξ i A) Hard Margin SVM B) Soft Margin SVM

stating the SVM as an objective function the hard-margin and soft-margin SVM can be stated mathematically in a number of ways we ll skip the details, but with a bit of work we can show that the soft-margin SVM can be stated as minimizing where is called the hinge loss C Loss(w, x i, y i ) + w 2 Loss(w, x, y) = max(0, 1 y (w x))

plot of the hinge loss 7 6 5 hinge loss 4 3 2 1 0 6 4 2 0 2 4 6 y * classifier score

in scikit-learn linear SVM is called sklearn.svm.linearsvc

overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers support vector classification optimizing the LR and SVM objectives

SVM and LR have convex objective functions 4.0 1.0 3.5 3.0 2.5 2.0 1.5 + 0.8 0.6 0.4 1.0 0.5 0.2 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.0 0.5 1.0 1.5 2.0 4.0 1.0 3.5 3.0 2.5 2.0 1.5 + 0.8 0.6 0.4 1.0 0.5 0.2 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 1.0 0.5 0.0 0.5 1.0 1.5 2.0

optimizing SVM and LR since the objective functions of SVM and LR are convex, we can find w by stochastic gradient descent pseudocode: set w to some initial value, e.g. all zero iterate a fixed number of times: select a single training instance x select a suitable step length η compute the gradient of the hinge loss or log loss subtract step length gradient from w note the similarity to the perceptron!

missing pieces setting the learning rate η gradients for SVM and LR loss functions (hinge loss and log loss)

setting the learning rate η in principle, you can try to select a small enough value of η in practice, it s better to decrease η gradually we ll use the Pegasos algorithm, where we set η as follows: η = C t where t is the current step (1, 2,... ) C is the loss/regularization tradeoff

some comments about assignment 2 implement SVM and LR and test them in a classifier Pegasos (which is just SGD) works in an iterative fashion similar to the perceptron... so if you start from my perceptron code this will be a breeze optional tasks to speed up the implementation using sparse vectors

next couple of weeks Thursday: lab session for assignment 1 Friday: evaluation methods Tuesday: no class! (CHARM) http://charm.chalmers.se/ Wednesday: deadline assignment 1 Thursday: lab session for assignment 2 Friday: guest lecture (Ericsson)