ECE521 Lecture7. Logistic Regression

Similar documents
ECE521 Lecture 7/8. Logistic Regression

ECE521 week 3: 23/26 January 2017

ECE521 Lectures 9 Fully Connected Neural Networks

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Linear Models for Classification

Machine Learning Lecture 7

Linear and Logistic Regression. Dr. Xiaowei Huang

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Logistic Regression. Machine Learning Fall 2018

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Machine Learning Linear Classification. Prof. Matteo Matteucci

Classification Based on Probability

Logistic Regression. COMP 527 Danushka Bollegala

Probabilistic Machine Learning. Industrial AI Lab.

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Machine Learning Lecture 5

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Lecture 5: Logistic Regression. Neural Networks

CS 340 Lec. 16: Logistic Regression

Machine Learning and Data Mining. Linear classification. Kalev Kask

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

Midterm, Fall 2003

Linear & nonlinear classifiers

Lecture 2 Machine Learning Review

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

CSC321 Lecture 4: Learning a Classifier

CSC 411 Lecture 7: Linear Classification

STA 4273H: Statistical Machine Learning

9 Classification. 9.1 Linear Classifiers

Machine Learning for Signal Processing Bayes Classification and Regression

Tufts COMP 135: Introduction to Machine Learning

Overfitting, Bias / Variance Analysis

Lecture 4 Logistic Regression

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Introduction to Signal Detection and Classification. Phani Chavali

Artificial Neural Networks

Neural Network Training

Machine Learning - Waseda University Logistic Regression

CSC321 Lecture 4 The Perceptron Algorithm

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Lecture 4: Training a Classifier

Lecture 6. Notes on Linear Algebra. Perceptron

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Classification 1: Linear regression of indicators, linear discriminant analysis

Machine Learning 2017

10-701/ Machine Learning - Midterm Exam, Fall 2010

Introduction to Logistic Regression

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

MLPR: Logistic Regression and Neural Networks

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Stochastic gradient descent; Classification

MODULE -4 BAYEIAN LEARNING

Linear classifiers: Overfitting and regularization

CSC321 Lecture 4: Learning a Classifier

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Ch 4. Linear Models for Classification

Linear Classification: Probabilistic Generative Models

Classification and Regression Trees

From perceptrons to word embeddings. Simon Šuster University of Groningen

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

Chapter 14 Combining Models

CSC 411: Lecture 04: Logistic Regression

Linear Models for Classification

The Perceptron Algorithm 1

Recap from previous lecture

Perceptron (Theory) + Linear Regression

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

CS229 Supplemental Lecture notes

Lecture 4: Training a Classifier

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CSC321 Lecture 5: Multilayer Perceptrons

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

10-701/ Machine Learning, Fall

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Linear Models for Regression CS534

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Machine Learning for Signal Processing Bayes Classification

Binary Classification / Perceptron

CMU-Q Lecture 24:

What does Bayes theorem give us? Lets revisit the ball in the box example.

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Gradient-Based Learning. Sargur N. Srihari

ECE 5424: Introduction to Machine Learning

Linear Classifiers. Michael Collins. January 18, 2012

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Learning with multiple models. Boosting.

Transcription:

ECE521 Lecture7 Logistic Regression

Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2

Outline Decision theory is conceptually easy and computationally hard Learning objectives: Optimizing misclassification rate -> choosing action according to the largest P(C x) Develop intuitions about misclassification on the P(x, C) graph Why expected loss is a good idea? => Capture asymmetry in our decisions 3

Decision theory We would like to say something formally about why we prefer certain actions over others. We would also like to have a framework to help us decide on the optimal action. In classification, the action is choosing the class label given the inputs. We again turn to probability theory and assume we have the joint distribution P(x, C) 4

Decision theory: misclassification rate One possible decision rule (a simple intuition): the best action for the current input x is to choose the class label C j that has the highest conditional probability P(C j x) among all the labels. decision boundary classify as C 1 in R 1 region classify as C 2 in R 2 region We can define a framework under which choosing the most probable class P(C j x) is the optimal decision rule: Define misclassification rate = P(mistake) and consider a binary classification task: misclassify x in R1 as C2 misclassify x in R2 as C1 5

Decision theory: misclassification rate Given the joint distribution of the inputs and the label, P(x, C), we can then plot p(x, C 1 ) and p(x, C 2 ). The goal is to slide (decision boundary) and adjust the regions R such that we minimize the misclassification rate. optimal decision boundary = (area of green + area of red) + area of blue 6

Decision theory: misclassification rate Given the joint distribution of the inputs and the label, P(x, C), we can then plot p(x, C 1 ) and p(x, C 2 ). The goal is to slide (decision boundary) and adjust the regions R such that we minimize the misclassification rate. optimal decision boundary Intuition: the overlapping area (green + blue) under the joint distribution P(x, C) is the minimum misclassification rate (the red region vanishes) This corresponds to picking R 1 such that P(x, C 1 ) > P(x, C 2 ) So we recovered our intuitive decision rule of picking the most probable class: minimizing misclassification rate is equivalent to picking the most probable class 7

Decision theory: loss matrix However, not all mistakes are the same. Misclassifying positive cancer patients as negative (false negative) can be devastating. It is not as bad as misclassifying a negative cancer patient as positive (false positive). Misclassifying non-faces as positive (false positive) can be extremely frustrating for smart digital cameras. False negatives are generally more tolerable in this case. We need a more general framework to weigh the importance of different mistakes under an application. 8

Decision theory: loss matrix We might as well just integrate over all regions under the joint distribution P(C, x) and weight them by a loss matrix L. We define the expected loss: In the cancer example, C 1 = cancer, C 2 = normal. Because false negatives are really bad, we can weight them highly: false negative false positive 9

Decision theory: loss matrix The misclassification rate is a special case of the expected loss in which all the off-diagonal terms of the loss matrix is 1 under Misclassification rate has equal weightings on the false positives and false negatives. It is often not desirable in many applications. 10

Decision theory: loss matrix Interesting facts: The loss matrix is also called the risk or utility function. One of the challenges of decision theory is figuring out what the loss matrix or utility function should be. It is known in economics that the utility of money is notoriously nonlinear and even inconsistent: e.g. below, most people choose A over B but at the same time prefer D over C: A. $1 million guaranteed B. 89% chance $1 million 10% chance $2.5 million 1% chance nothing C. 89% chance nothing 11% chance $1 million D. 90% chance nothing 10% $2.5 million 11

Outline Decision theory is conceptually easy and computationally hard Learning objectives: Optimizing misclassification rate -> choosing action according to the largest P(C x) Develop intuitions about misclassification on the P(x, C) graph Why expected loss is a good idea? => Capture asymmetry in our decisions 12

Outline Review of decision theory Logistic regression A single neuron Multi-class classification 13

Outline Logistic regression Prerequisite: linear regression, MLE Learning objectives: Logistic/sigmoid function and its derivatives Cross-entropy loss function and its derivatives Probabilistic interpretation (assignment 2) 14

Linear regression Given a set of training data and their real-valued output, linear regression models the output as a weighted sum of the inputs: This is also known as linear filtering, in signal processing 15

Linear regression The weight vector learns the linear trend/slope of the training data If the input x has zero mean, the bias unit learns the averaged output value 16

Linear regression as a classifier Given a set of training data and their target class labels (positive or negative class) encoded as 1 and 0, you can train a model that produces a score, and then threshold the score. Problems: It is not clear how to compute the confidence of of prediction in this setup What should be the threshold? Use decision theory to set it? 17

Linear regression as a classifier Alternatively, have a model directly regress the binary class labels We may decide to use a linear regression model to directly predict 1 and 0. Problems: What does it mean for the model to predict -10 on a test data? The range of the linear regression s prediction is unrestricted: to 18

Logistic regression A simple solution is to use a logistic/sigmoid function to squash the linear regression output to be between 0 and 1: Some nice properties: This is perfect to encode P(t=1 x)! 19

Logistic regression The sigmoid function can be derived from the ratio of log probabilities of two classes, which is also known as log-odds: If we rearrange the terms and solve for P(t=1 x), we will get again the sigmoid function: 20

Logistic regression We now have a new model whose output is a logistic function of the weighted sum of the inputs, i.e. the logistic regression model. The classification prediction is computed as: Define the training loss function using the squared L2 norm: 21

Logistic regression We now have a new model whose output is a logistic function of the weighted sum of the inputs, i.e. the logistic regression model. The classification prediction is computed as: Define the training loss function using the squared L2 norm: Take the gradient w.r.t. W 22

Logistic regression Recap: we started out first by designing a reasonable linear model. We chose an L2 loss function to measure the discrepancy between predicted label and the target class. So take a step back: why did we choose squared L2 in the first place? Analysis is easy. It is differentiable and has a simple gradient. It has a nice probabilistic interpretation with Gaussian error i. In the linear regression case, maximize the likelihood of data P(data) = minimizing sum of the squared L2 loss 23

Logistic regression Squared L2 loss has a problem for classification under the sigmoid function Consider gradient of the sigmoid function: The partial derivative is very close to zero away from the origin. i.e. there is very little learning signal as the model gets better 24

Cross-entropy loss function Cross-entropy is a better loss function. Take the gradient w.r.t. W. The gradient under the cross-entropy loss is the same as the gradient for the linear regression model! where, 25

Learning logistic regression What is happening during learning? The gradient reflects the correlation between the error and the inputs In what circumstances is the gradient zero? i.e. the model stops learning any new information. Either: All the individual gradients are zero (perfectly separable case), or The gradients from different training examples cancel out (most likely scenario) 26

Learning logistic regression The gradient reflects the correlation between the mistakes and the input features After learning, the values of the individual weights indicate the importance of its input to the final prediction If an input feature Xn is positively correlated with the target label, its weight will be a large positive value σ 27

Concepts in the course so far Problem formulations: i.i.d.; choose a loss function (distance function), i.e. squared L2 loss, cross-entropy loss; MLE, MAP (choose a likelihood function and prior, then optimize it to obtain the model parameters); weight-decay regularizer Learning algorithms: gradient descent, stochastic gradient descent, momentum, Models: linear regression; logistic regression; k-nn (no learning required) Some theoretical results Provide some additional intuition: how to pick the optimal regressor, optimal decision rules (how to set the threshold/decision boundary); expected loss 28

Outline Logistic regression Learning objectives: Logistic/sigmoid function and its derivatives Cross-entropy loss function and its derivatives Probabilistic interpretation (assignment 2) 29