CSC 411: Lecture 03: Linear Classification

Similar documents
CSC 411: Lecture 02: Linear Regression

CSC 411: Lecture 09: Naive Bayes

CSC 411: Lecture 04: Logistic Regression

CSC321 Lecture 4 The Perceptron Algorithm

CSC 411 Lecture 7: Linear Classification

Least Squares Classification

Linear Classifiers: Expressiveness

Stephen Scott.

Support vector machines Lecture 4

Performance Evaluation

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

ECE521 week 3: 23/26 January 2017

Kernel Methods and Support Vector Machines

Performance Metrics for Machine Learning. Sargur N. Srihari

CSC 411 Lecture 17: Support Vector Machine

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Performance evaluation of binary classifiers

MIRA, SVM, k-nn. Lirong Xia

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Evaluation. Andrea Passerini Machine Learning. Evaluation

Linear classifiers: Logistic regression

Classification: Analyzing Sentiment

Lecture 7: Kernels for Classification and Regression

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Bayesian Decision Theory

Evaluation requires to define performance measures to be optimized

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Warm up: risk prediction with logistic regression

CS 188: Artificial Intelligence Spring Announcements

Machine Learning Practice Page 2 of 2 10/28/13

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Lecture 3: Multiclass Classification

Mathematical Tools for Neuroscience (NEU 314) Princeton University, Spring 2016 Jonathan Pillow. Homework 8: Logistic Regression & Information Theory

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Classification: Analyzing Sentiment

CS145: INTRODUCTION TO DATA MINING

Support Vector Machines for Classification and Regression

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Model Accuracy Measures

Diagnostics. Gad Kimmel

Linear Methods for Classification

FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION

Day 3: Classification, logistic regression

Data Privacy in Biomedicine. Lecture 11b: Performance Measures for System Evaluation

CMSC858P Supervised Learning Methods

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

Classification and Regression Trees

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

Bayes Decision Theory - I

2018 EE448, Big Data Mining, Lecture 4. (Part I) Weinan Zhang Shanghai Jiao Tong University

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Machine Learning Concepts in Chemoinformatics

18.9 SUPPORT VECTOR MACHINES

Large-Margin Thresholded Ensembles for Ordinal Regression

15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, regularization, and evaluation

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Linear & nonlinear classifiers

Multiclass Classification-1

CPSC 340: Machine Learning and Data Mining

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

CSC321 Lecture 4: Learning a Classifier

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Support Vector Machine. Industrial AI Lab.

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Plan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics.

CS 188: Artificial Intelligence. Outline

Supervised Learning. George Konidaris

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Q-learning Tutorial. CSC411 Geoffrey Roeder. Slides Adapted from lecture: Rich Zemel, Raquel Urtasun, Sanja Fidler, Nitish Srivastava

CSC Neural Networks. Perceptron Learning Rule

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

CS 730/730W/830: Intro AI

Deep Learning for Computer Vision

CS 188: Artificial Intelligence Spring Announcements

Linear Models for Classification

More about the Perceptron

Lecture 3 Classification, Logistic Regression

SGD and Deep Learning

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Linear discriminant functions

CSC321 Lecture 4: Learning a Classifier

ECE521 Lectures 9 Fully Connected Neural Networks

Pointwise Exact Bootstrap Distributions of Cost Curves

Lecture 9: Classification, LDA

The Perceptron algorithm

Linear Classifiers as Pattern Detectors

Linear Classifiers as Pattern Detectors

Pattern Recognition 2018 Support Vector Machines

Machine Learning for natural language processing

Support Vector Machine & Its Applications

Lecture 2 Machine Learning Review

ECE521 Lecture7. Logistic Regression

Online Advertising is Big Business

Lecture 9: Classification, LDA

The Perceptron Algorithm 1

Classification and Pattern Recognition

Midterm, Fall 2003

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Transcription:

CSC 411: Lecture 03: Linear Classification Richard Zemel, Raquel Urtasun and Sanja Fidler University of Toronto Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 1 / 24

Examples of Problems What digit is this? How can I predict this? What are my input features? Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 2 / 24

Regression What do all these problems have in common? Categorical outputs, called labels (eg, yes/no, dog/cat/person/other) Assigning each input vector to one of a finite number of labels is called classification Binary classification: two possible labels (eg, yes/no, 0/1, cat/dog) Multi-class classification: multiple possible labels We will first look at binary problems, and discuss multi-class problems later in class Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 3 / 24

Today Linear Classification (binary) Key Concepts: Classification as regression Decision boundary Loss functions Metrics to evaluate classification Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 4 / 24

Classification vs Regression We are interested in mapping the input x X to a label t Y In regression typically Y = R Now Y is categorical Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 5 / 24

Classification as Regression Can we do this task using what we have learned in previous lectures? Simple hack: Ignore that the output is categorical! Suppose we have a binary problem, t { 1, 1} Assuming the standard model used for (linear) regression y(x) = f (x, w) = w T x How can we obtain w? Use least squares, w = (X T X) 1 X T t. How is X computed? and t? Which loss are we minimizing? Does it make sense? l square (w, t) = 1 N N (t (n) w T x (n) ) 2 How do I compute a label for a new example? Let s see an example n=1 Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 6 / 24

Classification as Regression One A dimensional 1D example: example (input x is 1-dim) x The colors indicate labels (a blue plus denotes that t (i) is from the first class, red circle that t (i) is from the second class) Greg Shakhnarovich (TTIC) Lecture 5: Regularization, intro to classification October 15, 2013 11 / 1 Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 7 / 24

Decision Rules Our classifier has the form f (x, w) = w o + w T x A reasonable decision rule is y = { 1 if f (x, w) 0 1 otherwise How can I mathematically write this rule? What does this function look like? y(x) = sign(w 0 + w T x) Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 8 / 24

Decision Rules A 1D example: +1 y ŷ = +1 ŷ = 1 x -1 w 0 + w T x How can I mathematically write this rule? y(x) = sign(w 0 + w T x) Greg Shakhnarovich (TTIC) Lecture 5: Regularization, intro to classification October 15, 2013 11 / 15 This specifies a linear classifier: it has a linear boundary (hyperplane) w 0 + w T x = 0 which separates the space into two half-spaces Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 9 / 24

Example in 1D The linear classifier has a linear boundary (hyperplane) w 0 + w T x = 0 which separates the space into two half-spaces In 1D this is simply a threshold Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 10 / 24

Example in 2D The linear classifier has a linear boundary (hyperplane) w 0 + w T x = 0 which separates the space into two half-spaces In 2D this is a line Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 11 / 24

Example in 3D The linear classifier has a linear boundary (hyperplane) w 0 + w T x = 0 which separates the space into two half-spaces In 3D this is a plane What about higher-dimensional spaces? Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 12 / 24

Geometry w T x = 0 a line passing though the origin and orthogonal to w w T x + w 0 = 0 shifts it by w 0 Figure from G. Shakhnarovich Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 13 / 24

Learning Linear Classifiers Learning consists in estimating a good decision boundary We need to find w (direction) and w 0 (location) of the boundary What does good mean? Is this boundary good? We need a criteria that tell us how to select the parameters Do you know any? Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 14 / 24

Loss functions Classifying using a linear decision boundary reduces the data dimension to 1 y(x) = sign(w 0 + w T x) What is the cost of being wrong? Loss function: L(y, t) is the loss incurred for predicting y when correct answer is t For medical diagnosis: For a diabetes screening test is it better to have false positives or false negatives? For movie ratings: The truth is that Alice thinks E.T. is worthy of a 4. How bad is it to predict a 5? How about a 2? Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 15 / 24

Loss functions A possible loss to minimize is the zero/one loss { 0 if y(x) = t L(y(x), t) = 1 if y(x) t Is this minimization easy to do? Why? Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 16 / 24

Other Loss functions Zero/one loss for a classifier L 0 1 (y(x), t) = { 0 if y(x) = t 1 if y(x) t Asymmetric Binary Loss α if y(x) = 1 t = 0 L ABL (y(x), t) = β if y(x) = 0 t = 1 0 if y(x) = t Squared (quadratic) loss L squared (y(x), t) = (t y(x)) 2 Absolute Error L absolute (y(x), t) = t y(x) Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 17 / 24

More Complex Loss Functions What if the movie predictions are used for rankings? Now the predicted ratings don t matter, just the order that they imply. In what order does Alice prefer E.T., Amelie and Titanic? Possibilities: 0-1 loss on the winner Permutation distance Accuracy of top K movies. Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 18 / 24

Can we always separate the classes? If we can separate the classes, the problem is linearly separable Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 19 / 24

Can we always separate the classes? Causes of non perfect separation: Model is too simple Noise in the inputs (i.e., data attributes) Simple features that do not account for all variations Errors in data targets (mis-labelings) Should we make the model complex enough to have perfect separation in the training data? Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 20 / 24

Metrics How to evaluate how good my classifier is? How is it doing on dog vs no-dog? Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 21 / 24

Metrics How to evaluate how good my classifier is? Recall: is the fraction of relevant instances that are retrieved R = TP TP + FN = TP all groundtruth instances Precision: is the fraction of retrieved instances that are relevant P = TP TP + FP = TP all predicted F1 score: harmonic mean of precision and recall F1 = 2 P R P + R Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 22 / 24

More on Metrics How to evaluate how good my classifier is? Precision: is the fraction of retrieved instances that are relevant Recall: is the fraction of relevant instances that are retrieved Precision Recall Curve Average Precision (AP): mean under the curve Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 23 / 24

Metrics vs Loss Metrics on a dataset is what we care about (performance) We typically cannot directly optimize for the metrics Our loss function should reflect the problem we are solving. We then hope it will yield models that will do well on our dataset Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 24 / 24