LEARNING & LINEAR CLASSIFIERS

Similar documents
Bayesian Decision Theory

Bayesian decision making

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Statistical Learning Reading Assignments

Support Vector Machines

Support Vector Machines. Machine Learning Fall 2017

Discriminative Models

Classifier performance evaluation

Computational Learning Theory (VC Dimension)

Linear classifiers Lecture 3

Discriminative Models

Machine Learning Lecture 5

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Multilayer Neural Networks

Logistic Regression. Machine Learning Fall 2018

Machine Learning Lecture 7

Optimization Methods for Machine Learning (OMML)

LECTURE NOTE #8 PROF. ALAN YUILLE. Can we find a linear classifier that separates the position and negative examples?

Multi-layer Neural Networks

Machine Learning 4771

Content. Learning. Regression vs Classification. Regression a.k.a. function approximation and Classification a.k.a. pattern recognition

Support vector machines Lecture 4

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Kernelized Perceptron Support Vector Machines

EM Algorithm LECTURE OUTLINE

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

GRADIENT DESCENT AND LOCAL MINIMA

Artificial Neural Networks (ANN)

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Computational Intelligence Lecture 3: Simple Neural Networks for Pattern Classification

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Intelligent Systems Discriminative Learning, Neural Networks

Machine Learning 2017

Machine Learning (CSE 446): Neural Networks

Support Vector Machines

Artificial Neural Networks

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Support Vector Machines.

Pattern Classification

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Neural Networks Lecture 2:Single Layer Classifiers

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Reading Group on Deep Learning Session 1

Midterm: CS 6375 Spring 2018

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Introduction to Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Multilayer Neural Networks

1 Machine Learning Concepts (16 points)

Input layer. Weight matrix [ ] Output layer

Linear Models for Classification

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Linear Discrimination Functions

9 Classification. 9.1 Linear Classifiers

Statistical Pattern Recognition

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Neural Network Learning: Testing Bounds on Sample Complexity

Introduction to Machine Learning

Machine Learning for Signal Processing Bayes Classification and Regression

Kernel Methods and Support Vector Machines

y(x n, w) t n 2. (1)

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Notation. Pattern Recognition II. Michal Haindl. Outline - PR Basic Concepts. Pattern Recognition Notions

Linear & nonlinear classifiers

Solving Classification Problems By Knowledge Sets

VC-dimension for characterizing classifiers

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Chapter 6 Classification and Prediction (2)

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Midterm exam CS 189/289, Fall 2015

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

FINAL: CS 6375 (Machine Learning) Fall 2014

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Inf2b Learning and Data

Course 395: Machine Learning - Lectures

CS798: Selected topics in Machine Learning

Dan Roth 461C, 3401 Walnut

Neural Networks DWML, /25

Linear & nonlinear classifiers

More about the Perceptron

Logistic Regression. COMP 527 Danushka Bollegala

Support Vector Machines

Lecture 3: Pattern Classification

Lecture 9: PGM Learning

Machine Learning (CS 567) Lecture 5

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Neural Networks and the Back-propagation Algorithm

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Transcription:

LEARNING & LINEAR CLASSIFIERS 1/26 J. Matas Czech Technical University, Faculty of Electrical Engineering Department of Cybernetics, Center for Machine Perception 121 35 Praha 2, Karlovo nám. 13, Czech Republic matas@fel.cvut.cz, http://cmp.felk.cvut.cz LECTURE PLAN The problem of classifier design. Learning in pattern recogniton. Linear classifiers. Perceptron algorithms. Optimal separating plane with the Kozinec algorithm.

CLASSIFIER DESIGN (1) The object of interest is characterised by observable properties x X and its class membership (unobservable, hidden state) k K, where X is the space of observations and K the set of hidden states. The objective of classifier design is to find a function q : X K that has some optimal properties. Bayesian decision theory solves the problem of minimisation of risk R(q) = x,k W (q(x), k) p(x, k) 2/26 given the following quantities: p(x, k), x X, k K the statistical model of the dependence of the observable properties (measurements) on class membership W (q(x), k) the loss of decision q(x) if the true class is k Do you know the solution for the 0-1 loss function?

CLASSIFIER DESIGN (2) 3/26 Non-Bayesian decision theory solves the problem if p(x k), x X, k K are known, but p(k) are unknown (or do not exist). Constraints or preferences for different errors depend on the problem formulation. However, in applications typically: Do you know any non-bayesion problem formulations? none of the probabilities are known! The designer is only given a training multiset T = {(x 1, k 1 )... (x L, k L )}, where L is the length (size) of the training multiset. the desired properties of the classifier q(x) are known How would you proceed?

CLASSIFIER DESIGN via PARAMETER ESTIMATION Assume p(x, k) have a particular form, e.g. Gaussian (mixture), piece-wise constant, etc., with a finite (i.e. small) number of parameters Θ k. Estimate the parameters from the using training set T Solve the classifier design problem (e.g. risk minimisation), substituting the estimated ˆp(x, k) for the true (and unknown) probabilities p(x, k)? : What estimation principle should be used? 4/26 What estimation paradigms do you know? : There is no direct relationship between known properties of estimated ˆp(x, k) and the properties (typically the risk) of the obtained classifier q (x) : If the true p(x, k) is not of the assumed form, q (x) may be arbitrarily bad, even if the size of training set L approaches infinity! + : Implementation is often straightforward, especially if parameters Θ k for each class are assumed independent. + : Performance on training data can be predicted by crossvalidation.

LEARNING in STATISTICAL PATTERN RECOGNITION Choose a class Q of decision functions (classifiers) q : X K. 5/26 Find q Q minimising some criterion function on the training set that approximates the risk R(q) (which cannot be computed). Learning paradigm is defined by the criterion function: Empirical risk (training set error) minimization. True risk approximated R emp (q Θ (x)) = 1 L L i=1 W (q Θ (x i ), k i ), Θ = argmin R emp (q Θ (x)) Θ Examples: Perceptron, Neural nets (Back-propagation), etc. Structural risk minimization. Example: SVM (Support Vector Machines).

OVERFITTING AND UNDERFITTING 6/26 How rich class Q of classifiers qθ(x) should be used? The problem of generalization is a key problem of pattern recognition: a small empirical risk R emp need not imply a small true expected risk R! underfit fit overfit

STRUCTURAL RISK MINIMIZATION PRINCIPLE (1) We would like to minimise the risk 7/26 R(q) = x,k W (q Θ (x), k) p(x, k) but p(x, k) is unknown. Vapnik and Chervonenkis proved a remarkable inequality ( R(q) R emp (q) + R str h, 1 ), L where h is VC dimension (capacity) of the class of strategies Q. Notes: + R str does not depend on the unknown p(x, k)!! + R str known for some classes of Q, e.g. linear classifiers.

STRUCTURAL RISK MINIMIZATION PRINCIPLE (2) There are more types of upper bounds on R. E.g. for linear discriminant functions 8/26 R m VC dimension (capacity) h R2 m 2 + 1 Examples of learning algorithms: SVM or ε-kozinec. ( ) (w, b w, x + b w, x + b ) = argmax min min, min w,b x X 1 w x X 2 w.

EMPIRICAL RISK MINIMISATION REVISITED 9/26 Is then empirical risk minimisation = minimisation of training set error, e.g. neural networks with backpropagation, dead? No! R str may be so large that the upper bound is useless. Find a tighter bound and you will be famous! It is not impossible! + Vapnik s theory justifies using empirical risk minimisation on classes of functions with VC dimension. + Vapnik suggests learning with progressively more complex classes Q. + Empirical risk minimisation is computationally hard (impossible for large L). Most classes of decision functions Q where empirical risk minimisation (at least local) can be effeciently organised are often useful. Where does the nearest neighbour classifier fit in the picture?

WHY ARE LINEAR CLASSIFIERS IMPORTANT? For some statistical models, the Bayesian or non-bayesian strategy is implemented by a linear discriminant function. 10/26 You should know an example!? Capacity (VC dimension) of linear strategies in an n-dimensional space is n + 2. Thus, the learning task is well-posed, i.e., strategy tuned on a finite training multiset does not differ much from correct strategy found for a statistical model. There are efficient learning algorithms for linear classifiers. Some non-linear discriminant functions can be implemented as linear after the feature space transformation.

LINEAR DISCRIMINANT FUNCTION q(x) 11/26 f j (x) = wj, x + bj, where denotes a scalar product. A strategy j = argmax j regions. f j (x) divides X into K convex k=5 k=4 k=6 k=1 k=2 k=3

DICHOTOMY, TWO CLASSES ONLY 12/26 K = 2, i.e. two hidden states (typically also classes) k = 1, if w, x + b 0, q(x) = k = 1, if w, x + b < 0. x 1 x 2

PERCEPTRON LEARNING: Formulation 13/26 Input: T = {(x 1, k 1 )... (x L, k L )}, k { 1, 1} Output: a weight vector w, offset b, satisfying, j {1..L} : w, x j + b 0 if k j = 1, w, x j + b < 0 if k j = 1 equivalently, multiplying both inequalities by k j, w, k j x j + k j b 0 or even simpler where x = [x 1], w = [w b] w, k j x j 0,

PERCEPTRON LEARNING: Simplified Formulation To simplify notation, we reformulate the problem. 14/26 Input: T = {x 1,... x L } where x j = k j[x j 1] Output: a weight vector w = [w b], such that : w, x j 0, j {1..L} We drop the primes and go back to w, x notation. The vector w has the offset b as last element, x has an extra 1 and has been multiplied by k.

PERCEPTRON LEARNING: THE ALGORITHM 15/26 Input: T = {x 1,... x L } Output: a weight vector w Perceptron algorithm, (Rosenblat 1962): 1. w 1 = 0. 2. A wrongly classified observation x j is sought, i.e., w t, x j < 0, j {1..L}. 3. If there is no misclassified observation then the algorithm terminates otherwise w t+1 = w t + x j. 4. Goto 2.

PERCEPTRON: WEIGHT UPDATE 16/26

NOVIKOFF THEOREM 17/26 If the data are linearly separable then there exists a number t D2, such that the m 2 vector w t satisfies the inequality w t, x j > 0, j {1..L}.? What if the data is not separable?? How to terminate perceptron learning? m D

PERCEPTRON LEARNING: Non-separable case Perceptron algorithm, batch version, handling non-separability: 18/26 Input: T = {x 1,... x L } Output: a weight vector w 1. w 1 = 0, E = T = L, w = 0. 2. Find all mis-classified observations X = {x X : w t, x < 0}. 3. if X < E then E = X ; w = w t 4. if tc(w, t, t lu ) then terminatate else w t+1 = w t + η t x X x 5. Goto 2. The algorithm converges with probability 1 to the optimal solution. Convergence rate not known (to me). Termination condition tc(.) is a complex function of the quality of the best solution, time since last update t t lu and requirements on the solution.

PERCEPTRON LEARNING as an Optimisation problem (1) Perceptron algorithm, batch version, handling non-separability, another perspective: 19/26 Input: T = {x 1,... x L } Output: a weight vector w minimsing or, equivalently J(w) = {x X : w t, x < 0} J(w) = x X: w t,x <0 What would the most common optimisation method, i.e. gradient descent, perform? w t = w η J(w) The gradient of J(w) is either 0 or undefined. Gradient minimisation cannot proceed.

PERCEPTRON LEARNING as an Optimisation problem (2) 20/26 Let us redefine the cost function: J p (w) = x X: w,x <0 w, x J p (w) = J w = x X: w,x <0 x The Perceptron Algorithm is a gradient descent method for Jp(w)! Learning and empirical risk minimisation is just and instance of an optimization problem. Either gradient minimisation (backpropagation in neural networks) or convex (quadratic) minimisation (in mathematical literature called convex programming) is used.

OPTIMAL SEPARATING PLANE and THE CLOSEST POINT TO THE CONVEX HULL 21/26 The problem of optimal separation by a hyperplane (1) w = argmax w min j w w, x j can be converted to seek for the closest point to a convex hull (denoted by the overline) x = argmin x x X There holds that x solves also the problem (1). Recall that the classfier that maximises separation minimises the structural risk R str (page 8)!

CONVEX HULL, ILLUSTRATION 22/26 X w*=m min j w w, x j m w, w X lower bound upper bound

ε-solution 23/26 The aim is to speed up the algorithm. The allowed uncertainty ε is introduced. w w min j w, x j ε

TRAINING ALGORITHM 2 KOZINEC (1973) 24/26 1. w 1 = x j, i.e. any observation. 2. A wrongly classified observation x t is sought, i.e., w t, x j < b, j J. 3. If there is no wrongly classified observation then the algorithm finishes otherwise w t+1 = (1 k) w t + x t k, k R. where k = argmin k 4. Goto 2. (1 k) w t + x t k.

KOZINEC, PICTORIAL ILLUSTRATION 25/26 w t w t+1 w,x =0 t b x t 0 Kozinec

KOZINEC and ε-solution The second step of Kozinec algorithm is modified to: 26/26 A wrongly classified observation x t is sought, i.e., w w t t min j w t, x t ε m 0 w t ε t