Logistic Regression. COMP 527 Danushka Bollegala

Similar documents
Stochastic gradient descent; Classification

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Neural Network Training

Linear Models for Classification

Machine Learning Lecture 5

Machine Learning for NLP

Machine Learning Lecture 7

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Warm up: risk prediction with logistic regression

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Logistic Regression & Neural Networks

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Linear & nonlinear classifiers

An Introduction to Statistical and Probabilistic Linear Models

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Introduction to Neural Networks

Introduction to Machine Learning

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Overfitting, Bias / Variance Analysis

Reading Group on Deep Learning Session 1

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Ch 4. Linear Models for Classification

Multiclass Logistic Regression

Stochastic Gradient Descent

ECE521 Lecture7. Logistic Regression

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Classification Based on Probability

CSC 411: Lecture 04: Logistic Regression

Artificial Neural Networks. MGS Lecture 2

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

CPSC 340: Machine Learning and Data Mining

Logistic Regression. Machine Learning Fall 2018

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Multilayer Perceptron

Statistical Data Mining and Machine Learning Hilary Term 2016

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Binary Classification / Perceptron

Introduction to Logistic Regression and Support Vector Machine

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Tufts COMP 135: Introduction to Machine Learning

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Classification Logistic Regression

Gaussian and Linear Discriminant Analysis; Multiclass Classification

ECE521 week 3: 23/26 January 2017

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Basics III

Logistic Regression. Stochastic Gradient Descent

Midterm, Fall 2003

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Ad Placement Strategies

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Machine Learning (CSE 446): Neural Networks

Machine Learning Lecture 10

Introduction to Neural Networks

Machine Learning Practice Page 2 of 2 10/28/13

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

10-701/ Machine Learning - Midterm Exam, Fall 2010

Logistic Regression Logistic

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

ECS171: Machine Learning

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

CS60021: Scalable Data Mining. Large Scale Machine Learning

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Lecture 3. Linear Regression

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Iterative Reweighted Least Squares

J. Sadeghi E. Patelli M. de Angelis

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

Cheng Soon Ong & Christian Walder. Canberra February June 2018

The Perceptron Algorithm 1

Introduction to Logistic Regression

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Logistic Regression. William Cohen

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Lecture 6. Notes on Linear Algebra. Perceptron

Machine Learning for NLP

Machine Learning and Data Mining. Linear classification. Kalev Kask

NLP Programming Tutorial 6 - Advanced Discriminative Learning

COMP90051 Statistical Machine Learning

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 Lectures 9 Fully Connected Neural Networks

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Neural Networks: Backpropagation

Lecture 5: Logistic Regression. Neural Networks

Linear Models for Classification

Transcription:

Logistic Regression COMP 527 Danushka Bollegala

Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will use the latter formulation as it simplifies the notation in subsequent derivations Binary classification can be seen as learning a function f such that f(x) returns either 1 or 0, indicating the predicted class 2

Some terms in Machine Learning Training dataset with N instances {(x 1,t 1 ),..., (x N,t N )} Target label (class) t: The class labels in the training dataset Annotated by humans (supervised learning) Predicted label Labels predicted by our model f(x) P(A B): conditional probability of observing an event A, given an event B P(A): marginal probability of event A We have marginalised out all the variables on which A depends upon (cf. margin of a probability table) Prior probability P(B) Posterior probability P(B A) 3

Logistic Regression is not a regression model is a classification model is the basis of many advanced machine learning methods neural networks, deep learning, conditional random fields,... Try to fit a logistic sigmoid function to predict the class labels 4

Logistic Sigmoid Function 1.5 1 0.5-2.4-2 -1.6-1.2-0.8-0.4 0 0.4 0.8 1.2 1.6 2 2.4 y = 1 1 + exp( x) -0.5-1 -1.5 5

Why do we use logistic sigmoid? Reason 1: We must squash the prediction score w T x, which is in the range (-,+ ) to the range [0,1] when performing binary classification Reason 2: (Bayes Rule) Posterior Conditional x Prior P (t =1 x) = = = P (x t = 1)P (t = 1) P (x) P (x t = 1)P (t = 1) P (t = 1)P (x t = 1) + P (t = 0)P (x t = 0) 1 1 1+ P (x t=1)p (t=1) P (t=0)p (x t=0) P (x t = 1)P (t = 1) exp(a) = P (t = 0)P (x t = 0) 1 P (t =1 x) = 1 + exp( a) = (a) 6

Likelihood We have a probabilistic model (logistic sigmoid function σ(w T x)) that tells us the probability of a particular training instance x being positive (t=1) or negative (t=0) We can use this model to predict the probability of the entire training dataset likelihood of the training dataset However, this dataset is already observed (we have it with us) If we want to explain this training dataset, then our model must maximise the likelihood for this training dataset (more than any other labelling of the dataset) Maximum Likelihood Estimate/Principle (MLE) 7

Maximum Likelihood Estimate y n = (w > x n )= t = (t 1,...,t n ) > p(t w) = NY y t n n (1 y n ) (1 t n) n=1 1 1 + exp( w > x n ) By taking the negative of the logarithm of the above product we define the cross-entropy error function X E(w) = ln p(t w) = N {t n ln y n + (1 t n ) ln(1 y n )} n=1 Q1 = ( ) = By differentiating E(w) w.r.t. w we get E(w) as follows: re(w) = NX (y n t n )x Q2 n n=1 8

Q1: Derivation of Cross Entropy Error Function 9

Q2: Derivation of the gradient 10

Updating the weight vector Generic update rule w (r+1) = w (r) re(w) Update rule with cross-entropy error function w (r+1) = w (r) (y n t n )x n 11

Logistic Regression Algorithm Given a set of training instances {(x 1,t 1 ),..., (x N,t N )}, learning rate, η, and iterations T Initialise weight vector w = 0 For j in 1,...,T For n in 1,...,N if pred(x i ) t i #misclassification w (r+1) = w (r) - η(y n -t n )x n Return the final weight vector w 12

Prediction Function pred Given the weight vector w, returns the class label for an instance x if w T x > 0: predicted label = +1 # positive class else: predicted label = 0 # negative class 13

Online vs. Batch Online vs. Batch Logistic Regression The algorithm we discussed in the previous slides is an online algorithm because it considers only one instance at a time and updates the weight vector Referred to as the Stochastic Gradient Descent (SGD) update In the batch version, we will compute the cross-entropy error over the entire training dataset and then update the weight vector Popular optimisation algorithm for the batch learning of logistic regression is the Limited Memory BFGS (L-BFGS) algorithm Batch version is slow compared to the SGD version. But shows slightly improved accuracies in many cases SGD version can require multiple iterations over the dataset before it converges (if ever) SGD is a technique that is frequently used with large scale machine learning tasks (even when the objective function is non-convex) 14

Regularisation Regularisation Reducing overfitting in a model by constraining it (reducing the complexity/no. of parameters) For classifiers that use a weight vector, regularisation can be done by minimising the norm (length) of the weight vector. Several popular regularisation methods exist L2 regularisation (ridge regression or Tikhonov regularisation) L1 regularisation (Lasso regression) L1+L2 regularisation (mixed regularisation) 15

L2 regularisation Let us denote the Loss of classifying a dataset D using a model represented by a weight vector w by L(D,w) and we would like to impose L2 regularisation on w. The overall objective to minimise can then be written as follows (here ℷ is called the regularisation coefficient and is set via cross-validation) The gradient of the overall objective simply becomes the addition of the loss-gradient and the scaled weight vector w. 16

Examples Note that SGD update for minimising a loss multiplies the loss gradient by a negative learning rate (η). Therefore, the L2 regularised update rules will have a -2ηλw term as shown in the following examples L2 regularised Perceptron update (for a misclassified instance we do) L2 regularised logistic regression 17

How to set ℷ Split your training dataset into training and validation parts (eg. 80%-20%) Try different values for ℷ (typically in the logarithmic scale). Train a different classification model for each ℷ and select the value that gives the best performance (eg. accuracy) on the validation data. ℷ = 10-5, 10-4, 10-3, 10-2, 10-1, 1, 0, 10 1, 10 2, 10 3, 10 4, 10 5 18

References Bishop (Pattern Recognition and Machine Learning) Section 4.3.2 Software scikit-learn (Python) http://scikit-learn.org/stable/modules/ generated/ sklearn.linear_model.logisticregression.html Classias (C) http://www.chokkan.org/software/classias/ 19