Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Similar documents
Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear Models for Classification

Linear discriminant functions

Logistic Regression. COMP 527 Danushka Bollegala

Inf2b Learning and Data

Ch 4. Linear Models for Classification

Machine Learning Lecture 5

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Multi-layer Neural Networks

Linear Discrimination Functions

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

The Perceptron Algorithm 1

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Machine Learning 2017

Linear & nonlinear classifiers

SGN (4 cr) Chapter 5

Single layer NN. Neuron Model

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Learning from Data Logistic Regression

Neural Network Training

Machine Learning Lecture 7

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

y(x n, w) t n 2. (1)

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear Discriminant Functions

Machine Learning Linear Classification. Prof. Matteo Matteucci

Lecture 4: Perceptrons and Multilayer Perceptrons

Intelligent Systems Discriminative Learning, Neural Networks

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Linear Models for Classification

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Multilayer Perceptron

Lecture 5: Logistic Regression. Neural Networks

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Linear & nonlinear classifiers

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

ECS171: Machine Learning

The Perceptron algorithm

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Lecture 3: Pattern Classification

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

MLPR: Logistic Regression and Neural Networks

Gradient Descent. Sargur Srihari

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Support Vector Machines

Introduction to Neural Networks

Classification with Perceptrons. Reading:

Machine Learning Practice Page 2 of 2 10/28/13

In the Name of God. Lecture 11: Single Layer Perceptrons

Neural Networks and the Back-propagation Algorithm

ECE521 Lectures 9 Fully Connected Neural Networks

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

ECS171: Machine Learning

Deep Learning for Computer Vision

Advanced statistical methods for data analysis Lecture 2

LECTURE NOTE #8 PROF. ALAN YUILLE. Can we find a linear classifier that separates the position and negative examples?

ECE521 Lecture7. Logistic Regression

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

CS 590D Lecture Notes

Lecture 3: Pattern Classification. Pattern classification

Lecture 16: Modern Classification (I) - Separating Hyperplanes

3.4 Linear Least-Squares Filter

LEARNING & LINEAR CLASSIFIERS

6.867 Machine Learning

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Machine Learning and Data Mining. Linear classification. Kalev Kask

Machine Learning for Signal Processing Bayes Classification and Regression

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Gaussian and Linear Discriminant Analysis; Multiclass Classification

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Simple Neural Nets For Pattern Classification

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Course 395: Machine Learning - Lectures

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Perceptron (Theory) + Linear Regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning for NLP

Kernel Methods and Support Vector Machines

Max Margin-Classifier

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Introduction to Machine Learning

Reading Group on Deep Learning Session 1

Neural Networks (Part 1) Goals for the lecture

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Gaussian Mixture Models

Incorporating detractors into SVM classification

Transcription:

Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB: Module 4F0

4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Introduction Previously generative models with Gaussian & GMM class-conditional PDFs were discussed. For Gaussian PDFs with common covariance matrices, the decision boundary is linear. The linear decision boundary from the generative model was briefly compared with a discriminative model, logistic regression. An alternative to modelling the class conditional PDFs functions is to select a functional form for a discriminant function and construct a mapping from the observation to the class directly. Here we will concentrate on the construction of linear classifiers, however it is also possible to construct quadratic or other non-linear decision boundaries (this is how some types of neural networks operate). There are several methods of classifier construction for a linear discriminant function. The following schemes will be examined: Iterative solution for the weights via the perceptron algorithm which directly minimises the number of misclassifications Fisher linear discriminant: directly maximises a measure of class separability Engineering Part IIB: Module 4F0

4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Simple Vowel Classifier Select two vowels to classify with a linear decision boundary 2400 Linear Classifier Formant Two 2200 2000 800 600 400 200 000 800 Most of the data is correctly classified but classification errors occur realisation of vowels vary from speaker to speaker the same vowel varies for the same speaker as well! (vowel context etc) 600 300 400 500 600 700 800 900 Formant One Engineering Part IIB: Module 4F0 2

4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers x x 2 w w 2 w 0 Σ z Single Layer Perceptron y(x) Typically uses threshold activation function The d-dimensional input vector x and scalar value z are related by x d w d z is then fed to the activation function to yield y(x). The parameters of this system are weights: w = w. selects w d direction of decision boundary bias: w 0, sets position of decision boundary. z = w x+w 0 Parameters are often combined into single composite vector, w, & extended input vector, x. w = w. w d w 0 ; x = x. x d Engineering Part IIB: Module 4F0 3

4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers We can then write z = w x The task is to train the set of model parameters w. For this example a decision boundary is placed at z = 0. The decision rule is Single Layer Perceptron (cont).5 y(x) = {, z 0, z < 0 0.5 0 If the training data is linearly separable in the d-dimensional space then using an appropriate training algorithm perfect classification (on the training data at least!) can be achieved. 0.5 0.5 0 0.5.5 Is the solution unique? Precise solution depends on training criterion/algorithm used. Engineering Part IIB: Module 4F0 4

4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Parameter Optimisation First a cost or loss function of weights defined, E( w). A learning process which minimises the cost function is used. One basic procedure is gradient descent:. Start with some initial estimate w[0], τ = 0. 2. Compute the gradient E( w) w[τ] 3. Update weight by moving a small distance in the steepest downhill direction, to give the estimate of the weights at iteration τ +, w[τ +], Set τ = τ + w[τ +] = w[τ] η E( w) w[τ] 4. Repeat steps (2) and (3) until convergence, or optimisation criterion satisfied If using gradient descent, then cost function E( w) must be differentiable (and hence continuous). This means mis-classification cannot be used as the cost function for gradient descent schemes. Gradient descent is not usually guaranteed to decrease cost function. Engineering Part IIB: Module 4F0 5

4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Choice of η In the previous slide the learning rate term η was used in the optimisation scheme. E slow descent Step size too large - divergence Desired minima η is positive. When setting η we need to consider: if η is too small, convergence is slow; if η is too large, we may overshoot the solution and diverge. Later in the course we will examine improvements for gradient descent schemes. Some of these give automated techniques for setting η. Engineering Part IIB: Module 4F0 6

4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Local Minima Another problem with gradient descent is local minima. At all maxima and minima E( w) = 0 gradient descent stops at all maxima/minima estimated parameters will depend on starting point Criterion Global Minimum Local Minima single solution if function convex Expectation-Maximisation is also only guaranteed to find a local maximum of the likelihood function for example. θ θ Engineering Part IIB: Module 4F0 7

4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Perceptron Criterion The decision boundary should be constructed to minimise the number of misclassified training examples. For each training example x i w x i > 0 x i ω w x i < 0 x i ω 2 Distance from the decision boundary is given by w x, but cannot simply sum the values of w x as a cost function since the sign depends on the class. For training, we replace all the observations of class ω 2 by their negative value. Thus x = { x; belongs to ω x; belongs to ω 2 This means that for a correctly classified symbol w x > 0 and for mis-classified training example w x < 0 We will refer to this as normalising the data. Engineering Part IIB: Module 4F0 8

4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Perceptron Solution Region Each sample x i places a constraint on possible location of a solution vector that classifies all samples correctly. w x i = 0 defines a hyperplane through the origin of the weight-space of w vectors with x i as a normal vector. For normalised data, solution vector on positive side of every such hyperplane. Solution vector, if it exists, is not unique lies anywhere within solution region Figure below (from DHS) shows 4 training points and the solution region for both normalised data and un-normalised data. solution region x 2 solution region x 2 w w separating plane x "separating" plane x Engineering Part IIB: Module 4F0 9

4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Perceptron Criterion (cont) The perceptron criterion may be expressed as E( w) = x Y( w x) where Y is the set of mis-classified points. We now want to minimise the perceptron criterion. We can use gradient descent. It is simple to show that E( w) = x Y( x) Hence the GD update rule is w[τ +] = w[τ]+η x x Y[τ] where Y[τ] is the set of mis-classified points using w[τ]. For the case when the samples are linearly separable using a value of η = is guaranteed to converge. Engineering Part IIB: Module 4F0 0

4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers The basic algorithm is:. Take the extended observations x for class ω 2 and invert the sign to give x. 2. Initialise the weight vector w[0], τ = 0. 3. Using w[τ] produce the set of mis-classified samples Y[τ]. 4. Use update rule w[τ +] = w[τ]+ x then set τ = τ +. x Y[τ] 5. Repeat steps (3) and (4) until the convergence criterion is satisfied. Engineering Part IIB: Module 4F0

4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Class has points Class 2 has points Initial estimate: w[0] = [ ] [, 0 [ ] [ 0 0, 0 0 0.5 A Simple Example ] [ ] [ ] 0.6 0.7,, 0.6 0.4 ] [ ] [ ] 0.25 0.3,, 0.4.5 This yields the following initial estimate of the decision boundary. 0.5 Given this initial estimate we need to train the decision boundary. 0 0.5 0.5 0 0.5.5 Engineering Part IIB: Module 4F0 2

4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Simple Example (cont) First use current decision boundary to obtain the set of mis-classified points. For the data from Class ω Class ω 2 3 4 z -0.5 0.5 0. -0. Class 2 2 The set of mis-classified points, Y[0], is and for Class ω 2 Class ω 2 2 3 4 z -0.5 0.5 0.5-0. Class 2 2 Y[0] = { ω,4 ω,2 ω2,3 ω2 } From the perceptron update rule this yields the updated vector w[] = w[0]+ 0 + 0.7 0.4 + 0 + 0.25 =.45 0.6 0.5 Engineering Part IIB: Module 4F0 3

4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Apply decision rule again: class ω data Simple Example (cont) and for class ω 2 Class 2 3 4 Class 2 2 3 4 z 0.95 0.35 0.0 0.275 z -0.5 -. -0.7375-0.305 Class Class 2 2 2 2 All points correctly classified the algorithm has converged..5 This yields the following decision boundary 0.5 Is this a good decision boundary? 0 0.5 0.5 0 0.5.5 Engineering Part IIB: Module 4F0 4

4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Fisher s Discriminant Analysis An alternative approach to training the perceptron uses Fisher s discriminant analysis. The aim is to find the projection that maximises the distance between the class means, whilst minimising the within class variance. Note only the projection w is determined. The following cost function is used E(w) = (µ µ 2 ) 2 s +s 2 where s j and µ j are the projected scatter matrix and mean for class ω j. The projected scatter matrix is defined as s j = x i ω j (x i µ j ) 2 The cost function may be expressed as where E(w) = w S b w w S w w S b = (µ µ 2 )(µ µ 2 ) S w = S +S 2 S j = x i ω j (x i µ j )(x i µ j ) the mean of class µ j is defined as usual. Engineering Part IIB: Module 4F0 5

4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Fisher s Discriminant Analysis (cont) Differerentiating E(w) w.r.t. weights, it is minimised when (ŵ S b ŵ)s w ŵ = (ŵ S w ŵ)s b ŵ From the definition of S b S b ŵ = (µ µ 2 )(µ µ 2 ) ŵ = ((µ µ 2 ) ŵ)(µ µ 2 ) We therefore know that S w ŵ (µ µ 2 ) Multiplying both sides by S w yields ŵ S w (µ µ 2 ) This has given the direction of the decision boundary. However we still need the bias value b. If the data is separable using the Fisher s discriminant it makes sense to select the value of b that maximises the margin. Simply put this means that, given no additional information, the boundary should be equidistant from the two points either side of the decision boundary. Engineering Part IIB: Module 4F0 6

4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Example Using the previously described data µ = [ 0.825 0.500 ] ; µ 2 = [ 0.375 0.600 ] ; S w = [ 0.2044 0.0300 0.0300.2400 ] Solving yields the direction. ŵ = [ 3.3878 0.626 ].5 Taking the midpoint between the observations closest to the decision boundary w[] = 3.3878 0.626.4432 0.5 0 0.5 0.5 0 0.5.5 Engineering Part IIB: Module 4F0 7

4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Logistic Regression/Classification Interesting to contrast the perceptron algorithm with logistic regression/classification. This is a linear classifier, but a discriminative model not a discriminant function. The training criterion is L( w) = n i= [ ( ) y i log +exp( w x i ) ( )] +( y i )log +exp( w x i ) where (note change of label definition) y i = {, xi generated by class ω 0, x i generated by class ω 2 It is simple to show that (has a nice intuitive feel) L( w) = n i= ( y i ) +exp( w x i ) Engineering Part IIB: Module 4F0 8 x i

4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Maximising the log-likelihood is equal to minimising the negative log-likelihood, E( w) = L( w). Update rule then becomes w[τ +] = w[τ]+η L( w) w[τ] Compared to the perceptron algorithm: need an appropriate method to select η; does not necessarily correctly classify training data even if linearly separable; not necessary for data to be linearly separable; yields a class posterior (use Bayes decision rule to get hypothesis). Engineering Part IIB: Module 4F0 9

4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Kesler s Construct So far only examined binary classifiers direct use of multiple binary classifiers can result in no-decision regions (see examples paper). The multi-class problem can be converted to a 2-class problem. Consider an extended observation x which belongs to class ω. Then to be correctly classified w x w j x > 0, j = 2,...,K There are therefore K inequalities requiring that K(d + )-dimensional vector α = w. w K Engineering Part IIB: Module 4F0 20

4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers correctly classifies all K set of K(d+)-dimensional samples γ 2 = x x 0. 0, γ 3 = x 0 x. 0,..., γ K = x 0 0. x The multi-class problem has now been transformed to a two-class problem at the expense of increasing the effective dimensionality of the data and increasing the number of training samples. We now simply optimise for α. Engineering Part IIB: Module 4F0 2

4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Limitations of Linear Decision Boundaries Perceptrons were very popular until in 969, it was realised that it couldn t solve the XOR problem. We can use perceptrons to solve the binary logic operators AND, OR, NAND, NOR. x.5 AND x 0.5 OR x 2 Σ (a) AND operator x 2 Σ (b) OR operator x.5 NAND x 0.5 NOR x 2 Σ (c) NAND operator x 2 Σ (d) NOR operator Engineering Part IIB: Module 4F0 22

4F0: Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers XOR (cont) But XOR may be written in terms of AND, NAND and OR gates 0.5 OR x x 2 Σ.5 AND.5 NAND Σ Σ This yields the decision boundaries So XOR can be solved using a two-layer network. The problem is how to train multi-layer perceptrons. In the 980 s an algorithm for training such networks was proposed, error back propagation. Engineering Part IIB: Module 4F0 23