Classification and Pattern Recognition

Similar documents
Clustering. Léon Bottou COS 424 3/4/2010. NEC Labs America

Numerical Optimization Techniques

Expectation-Maximization

Warm up: risk prediction with logistic regression

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Machine Learning Lecture 5

Machine Learning for NLP

Kernel Methods and Support Vector Machines

Classification objectives COMS 4771

(Kernels +) Support Vector Machines

Max Margin-Classifier

Machine Learning Practice Page 2 of 2 10/28/13

Deep Learning for Computer Vision

Linear Regression (continued)

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

1 Machine Learning Concepts (16 points)

CSC 411 Lecture 17: Support Vector Machine

18.9 SUPPORT VECTOR MACHINES

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Gaussian and Linear Discriminant Analysis; Multiclass Classification

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

CS 188: Artificial Intelligence Spring Announcements

Day 3: Classification, logistic regression

Machine Learning Linear Classification. Prof. Matteo Matteucci

Classification: The rest of the story

Ch 4. Linear Models for Classification

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification

Discriminative Learning and Big Data

ESS2222. Lecture 4 Linear model

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Neural Networks and Deep Learning

Consistency of Nearest Neighbor Methods

DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM

Linear Models for Classification

Machine Learning Lecture 12

Linear & nonlinear classifiers

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

MIRA, SVM, k-nn. Lirong Xia

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

CSCI567 Machine Learning (Fall 2018)

Support Vector Machine (SVM) and Kernel Methods

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

COMS 4771 Introduction to Machine Learning. Nakul Verma

CS 188: Artificial Intelligence Spring Announcements

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Linear discriminant functions

Machine Learning Lecture 7

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Midterm: CS 6375 Spring 2015 Solutions

CIS 520: Machine Learning Oct 09, Kernel Methods

Jeff Howbert Introduction to Machine Learning Winter

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

10-701/ Machine Learning - Midterm Exam, Fall 2010

Support Vector Machine (SVM) and Kernel Methods

L11: Pattern recognition principles

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Discriminative Models

Brief Introduction to Machine Learning

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

SUPPORT VECTOR MACHINE

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

ECE 5984: Introduction to Machine Learning

Statistical Methods for SVM

Introduction to Machine Learning

Support vector machines Lecture 4

Perceptron (Theory) + Linear Regression

Day 4: Classification, support vector machines

9 Classification. 9.1 Linear Classifiers

Machine Learning Foundations

Supervised Learning. George Konidaris

Midterm exam CS 189/289, Fall 2015

Support Vector Machines

Support Vector Machines for Classification: A Statistical Portrait

Lecture Support Vector Machine (SVM) Classifiers

Machine Learning and Data Mining. Linear classification. Kalev Kask

Bayes rule and Bayes error. Donglin Zeng, Department of Biostatistics, University of North Carolina

Introduction to Support Vector Machines

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Classification and Support Vector Machine

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Lecture 3: Multiclass Classification

Statistical learning theory, Support vector machines, and Bioinformatics

CSE 546 Final Exam, Autumn 2013

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

ECS171: Machine Learning

Machine Learning And Applications: Supervised Learning-SVM

Logistic Regression. Machine Learning Fall 2018

Transcription:

Classification and Pattern Recognition Léon Bottou NEC Labs America COS 424 2/23/2010

The machine learning mix and match Goals Representation Capacity Control Operational Considerations Computational Considerations Classification, clustering, regression, other. Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Linear vs. nonlinear Deep vs. shallow Explicit: architecture, feature selection Explicit: regularization, priors Implicit: approximate optimization Implicit: bayesian averaging, ensembles Loss functions Budget constraints Online vs. offline Exact algorithms for small datasets. Stochastic algorithms for big datasets. Parallel algorithms. Léon Bottou 2/32 COS 424 2/23/2010

Topics for today s lecture Goals Representation Capacity Control Operational Considerations Computational Considerations Classification, clustering, regression, other. Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Linear vs. nonlinear Deep vs. shallow Explicit: architecture, feature selection Explicit: regularization, priors Implicit: approximate optimization Implicit: bayesian averaging, ensembles Loss functions Budget constraints Online vs. offline Exact algorithms for small datasets. Stochastic algorithms for big datasets. Parallel algorithms. Léon Bottou 3/32 COS 424 2/23/2010

Summary 1. Bayesian decision theory 2. Nearest neigbours 3. Parametric classifiers 4. Surrogate loss functions 5. ROC curve. 6. Multiclass and multilabel problems Léon Bottou 4/32 COS 424 2/23/2010

Classification a.k.a. Pattern recognition Association between patterns x X and classes y Y. The pattern space X is unspecified. For instance, X = R d. The class space Y is an unordered finite set. Examples: Binary classification (Y = {±1}). Fraud detection, anomaly detection,... Multiclass classification: (Y = {C 1, C 2,... C M }) Object recognition, speaker identification, face recognition,... Multilabel classification: (Y is a power set). Document topic recognition,... Sequence recognition: (Y contains sequences). Speech recognition, signal identification,.... Léon Bottou 5/32 COS 424 2/23/2010

Probabilistic model Patterns and classes are represented by random variables X and Y. P (X, Y ) = P (X) P (Y X) = P (Y ) P (X Y ) Léon Bottou 6/32 COS 424 2/23/2010

Bayes decision theory Consider a classifier x X f(x) Y. Maximixe the probability of correct answer: P {f(x) = Y } = 1I(f(x) = y) dp (x, y) = 1I(f(x) = y) P {Y = y X = x} dp (x) = y Y P {Y = f(x) X = x} dp (x) Bayes optimal decision rule: f (x) = arg max y Y Bayes optimal error rate: B = 1 max y Y P {Y = y X = x} P {Y = y X = x} dp (x). Léon Bottou 7/32 COS 424 2/23/2010

Bayes optimal decision rule Comparing class densities p y (x) scaled by the class priors P y = P {Y = y}: Hatched area represents the Bayes optimal error rate. Léon Bottou 8/32 COS 424 2/23/2010

How to build a classifier from data Given a finite set of training examples {(x 1, y 1 ),..., (x n, y m )}? Estimating probabilities: Find a plausible probability distribution (next lecture). Compute or approximate the optimal Bayes classifier. Minimize empirical error: Choose a parametrized family of classification functions a priori. Pick one that minimize the observed error rate. Nearest neighbours: Determine class of x on the basis of the closest example(s). Léon Bottou 9/32 COS 424 2/23/2010

Nearest neighbours Let d(x, x ) be a distance on the patterns. Nearest neighbour rule (1NN) Give x the class of the closest training example. f nn (x) = y nn(x) with nn(x) = arg min i d(x, x i ). K-Nearest neighbours rule (knn) Give x the most frequent class among the K closest training examples. K-Nearest neighbours variants Weighted votes (according the the distances) Léon Bottou 10/32 COS 424 2/23/2010

Voronoi tesselation Euclian distance in the plane Cosine distance on the sphere 1NN: Piecewise constant classifier defined on the Voronoi cells. knn: Same, but with smaller cells and additional constraints. Léon Bottou 11/32 COS 424 2/23/2010

1NN and Optimal Bayes Error Theorem (Cover & Hart, 1967) : Assume η y (x) = P {Y = y X = x} is continuous. When n, B P {f nn (X) Y } 2B. Easy proof when there are only two classes Let η(x) = P {Y = +1 X = x}. B = min(η(x), 1 η(x)) dp (x) P {f nn (X) Y } = η(x)(1 η(x )) + (1 η(x))η(x ) dp (x) 2 η(x)(1 η(x)) dp (x) Léon Bottou 12/32 COS 424 2/23/2010

1NN versus knn Using more neighbours Is to Bayes rule in the limit. Needs more examples to approach the condition η(x knn(x) ) η(x) 1 0.8 0.6 0.4 0.2 Bayes Bayes*2 1 nn 3 nn 5 nn 7 nn 51 nn 0 0 0.2 0.4 0.6 0.8 1 K is a capacity parameter to be determined using a validation set. Léon Bottou 13/32 COS 424 2/23/2010

Computation Straightforward implementation Computing f(x) requires n distance computations. ( ) Grows with the number of examples. (+) Embarrassingly parallelizable. Data structures to speedup the search: K-D trees (+) Very effective in low dimension ( ) Nearly useless in high dimension Shortcutting the computation of distances Stop computing as soon as a distance gets non-competitive. Use the triangular inequality d(x, x i ) d(x, x ) d(x i, x ) Pick r well spread patterns x (1)... x (r). Precompute d(x i, x (j) ) for i = 1... n and j = 1... r. Lower bound d(x, x i ) max j=1...r d(x, x (j) ) d(x i, x (j) ). Shortcut if lower bound is not competitive. Léon Bottou 14/32 COS 424 2/23/2010

Distances Nearest Neighbour performance is sensitive to distance. Euclidian distance: d(x, x ) = (x x ) 2 do not take the square root! Mahalanobis distance: d(x, x ) = (x x ) A (x x ) Mahalanobis distance: A = Σ 1 Safe variant: A = (Σ + ɛi) 1 Dimensionality reduction: Diagonalize Σ = Q ΛQ. Drop the low eigenvalues and corresponding eigenvector. Define x = Λ 1/2 Q x. Precompute all the x i. Compute d(x, x i ) = ( x x i ) 2. Léon Bottou 15/32 COS 424 2/23/2010

Discriminant function Binary classification: y = ±1 Discriminant function: f w (x) Assigns class sign(f w (x)) to pattern x. Symbol x represents parameters to be learnt. Example: Linear discriminant function f w (x) = w Φ(x). Léon Bottou 16/32 COS 424 2/23/2010

Example: The Perceptron The perceptron is a linear discriminant function Retina Associative area x Treshold element w x sign(w x) Léon Bottou 17/32 COS 424 2/23/2010

The Perceptron Algorithm Initialize w 0. Loop Pick example x i, y i If y i w Φ(x i ) 0 then w w + y i Φ(x i ) Until all examples are correctly classified Perceptron theorem Guaranteed to stop if the training data is linearly separable Perceptron via Stochastic Gradient Descent SGD for minimizing C(w) = i max(0, y i w Φ(x i )) gives: If y i w Φ(x i ) 0 then w w + γ y i Φ(x i ) Léon Bottou 18/32 COS 424 2/23/2010

The Perceptron Mark 1 (1957) The Perceptron is not an algorithm. The Perceptron is a machine! Léon Bottou 19/32 COS 424 2/23/2010

Minimize the empirical error rate Empirical error rate min w 1 n n i=1 1I{y i f(x i, w) 0} Misclassification loss function Noncontinuous Nondifferentiable Nonconvex y y(x) ^ Léon Bottou 20/32 COS 424 2/23/2010

Surrogate loss function Minimize instead min w 1 n n i=1 l(y i f(x i, w)) Quadratic surrogate loss Quadratic: l(z) = (z 1) 2 y y(x) ^ Léon Bottou 21/32 COS 424 2/23/2010

Surrogate loss functions Exp loss and Log loss Exp loss: l(z) = exp( z) Log loss: l(z) = log(1 + exp( z)) y y(x) ^ Hinges Perceptron loss: l(z) = max(0, z) Hinge loss: l(z) = max(0, 1 z) y y(x) ^ Léon Bottou 22/32 COS 424 2/23/2010

Surrogate loss function Quadratic+Sigmoid Let σ(z) = tanh(z). l(z) = (σ( 3 2z) 1)2 y y(x) ^ Ramp Ramp loss: l(z) = [1 z] + [s z] + y y(x) ^ Léon Bottou 23/32 COS 424 2/23/2010

Choice of a surrogate loss function Constraints from the optimization algorithm A convex loss with a convex f w (x) ensures the unicity of the minimum. Optimization by gradient descent suggests differentiable losses. Dual optimizatoin methods work well with hinges. Class calibrated loss In the limit min [ η(x)l(f w (x)) + (1 η(x))l( f w (x)) ] dp (x). Define L(η, z) = ηl(z) + (1 η)l( z). If we had an infinite training set and a fully flexible f w (x), we would have: f(x) = arg min L(P {Y = +1 X = x}, z). Examples. z Léon Bottou 24/32 COS 424 2/23/2010

Asymmetric cost problem Binary classification. Positive class y = +1, negative class y = 1. Examples of positive classes. fraudulent credit card transaction relevant document for a given query heart failure detection Different kinds of errors have different costs. False positive, false detection, false alarm. False negative, non detection. Costs are difficult to assess. Léon Bottou 25/32 COS 424 2/23/2010

Receiver Operating Curve (ROC) Changing the threshold Assigned class is sign(f(x) b). True positives: F + (b) = P {f(x) b > 0 Y = +1} False positives: F (b) = P {f(x) b > 0 Y = 1} Léon Bottou 26/32 COS 424 2/23/2010

Optimal decision rule with asymmetric costs Optimal asymmetric decision rule Let C y be the cost of erroneously assigning class y to an example. We want to minimize C y 1I(f(x) = y) P {Y y X = x} dp (x). f(x) = arg min y=±1 y=±1 C y P {Y y X = x} = sign ( η(x) C + C + + C ) Optimal ROC curve The optimal decision rules have the form sign(f(x) b) Therefore f(x) = η(x) = P {Y = +1 X} gives the optimal ROC curve. Same for monotone transformations of f(x). Léon Bottou 27/32 COS 424 2/23/2010

Empirical ROC Léon Bottou 28/32 COS 424 2/23/2010

Ranking Find a function f w (x) with ROC close to the optimal ROC. Maximize Area Under Curve (AUC) We would like min 1I{f(x i, w) f(x j, w)} i P j N With a surrogate min l(f(x i, w) f(x j, w)) i P j N Ranking the best instances AUC often optimizes useless parts of the ROC curve. Various algorithms have been proposed to do better.... Léon Bottou 29/32 COS 424 2/23/2010

What to do with more than two classes? Turning the problem into multiple binary classification problems. One versus all (M classifiers). Classifier f k (x) detects class k. Recognized class is arg max k f x (x). Each classifier is trained on the full dataset. Dubious principle. Works well in practice. One versus others (M(M 1)/2 classifiers) Classifier f k,k separates class k from class k. Recognized class if arg max k k f k,k (x). Classifier f k,k is trained on examples from classes k and k. Dubious principle. Often faster but sligtly worse. Léon Bottou 30/32 COS 424 2/23/2010

What to do with more than two classes? Doing it right! Learn a function S w (x, y) that measures how well y goes with x. Recognized class arg max y S w (x, y) Cost functions Perceptron-like: min w Hinge-like: Logloss-like: min w min w 1 n 1 n 1 n n i=1 n i=1 n i=1 S w (x i, y i ) + max y S w (x i, y) max [ 1 S w (x i, y i ) + max y y i S w (x i, y) ] + S w (x i, y i ) + log ( e S w(x i,y) ) y Comments More costly than OVA. Not better than OVA in practice. Léon Bottou 31/32 COS 424 2/23/2010

Multilabel Problems Documents can treat multiple topics. Therefore y is a subset of the set of topics. Simple approach One binary classification for each topic. But labels are not independent: taxonomies, related topics. Complex scoring functions f k (x) gives a score for document x and topic k. R w (y) measures the compatibility the topic set y. Recognized topics: arg max R w ({y 1... y k }) + f k (x). y 1...y k k Same loss functions as the multiclass problem. Léon Bottou 32/32 COS 424 2/23/2010