Learning Vector Quantization

Similar documents
Learning Vector Quantization (LVQ)

Learning with Momentum, Conjugate Gradient Learning

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Artificial Neural Networks. Edward Gatt

Artificial Neural Networks Examination, March 2004

Artificial Neural Networks Examination, June 2005

9 Competitive Neural Networks

Artificial Neural Networks Examination, June 2004

Sample Exam COMP 9444 NEURAL NETWORKS Solutions

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Classic K -means clustering. Classic K -means example (K = 2) Finding the optimal w k. Finding the optimal s n J =

Multilayer Perceptrons and Backpropagation

Cheng Soon Ong & Christian Walder. Canberra February June 2018

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Neural Networks Lecture 4: Radial Bases Function Networks

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Classification with Perceptrons. Reading:

Introduction to Neural Networks

Radial-Basis Function Networks

Radial-Basis Function Networks

Artificial Neural Network : Training

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Lecture 4: Perceptrons and Multilayer Perceptrons

Scuola di Calcolo Scientifico con MATLAB (SCSM) 2017 Palermo 31 Luglio - 4 Agosto 2017

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

4. Multilayer Perceptrons

Multilayer Neural Networks

Using a Hopfield Network: A Nuts and Bolts Approach

Chapter 2 Single Layer Feedforward Networks

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Radial Basis Function Networks. Ravi Kaushik Project 1 CSC Neural Networks and Pattern Recognition

y(x n, w) t n 2. (1)

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Neural Networks Based on Competition

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Instruction Sheet for SOFT COMPUTING LABORATORY (EE 753/1)

Lecture 20: Quantization and Rate-Distortion

Machine Learning. Neural Networks

Supervised Learning in Neural Networks

Unsupervised Classifiers, Mutual Information and 'Phantom Targets'

9 Classification. 9.1 Linear Classifiers

Radial Basis Function Networks: Algorithms

Neural Networks biological neuron artificial neuron 1

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Unit III. A Survey of Neural Network Model

Vorlesung Neuronale Netze - Radiale-Basisfunktionen (RBF)-Netze -

Data Mining Part 5. Prediction

Lecture 5: Logistic Regression. Neural Networks

Machine Learning Lecture 5

Unit 8: Introduction to neural networks. Perceptrons

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Convolutional Neural Networks

Latent Variable Models and Expectation Maximization

CSC Neural Networks. Perceptron Learning Rule

Neural Networks and the Back-propagation Algorithm

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Christian Mohr

CSC321 Lecture 4: Learning a Classifier

CHALMERS, GÖTEBORGS UNIVERSITET. EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

Neural Network Training

Artificial Neural Networks. MGS Lecture 2

In the Name of God. Lectures 15&16: Radial Basis Function Networks

Analysis of Interest Rate Curves Clustering Using Self-Organising Maps

RegML 2018 Class 8 Deep learning

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Machine Learning for Data Science (CS4786) Lecture 8

Neural Networks Lecture 7: Self Organizing Maps

UNSUPERVISED LEARNING

Application of the Cross-Entropy Method to Clustering and Vector Quantization

CSC242: Intro to AI. Lecture 21

Lecture XI Learning with Distal Teachers (Forward and Inverse Models)

Multi-layer Neural Networks

Linear & nonlinear classifiers

Lab 5: 16 th April Exercises on Neural Networks

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Linear discriminant functions

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

18.6 Regression and Classification with Linear Models

1. A discrete-time recurrent network is described by the following equation: y(n + 1) = A y(n) + B x(n)

Latent Variable Models and Expectation Maximization

Pattern Classification

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Self-organizing Neural Networks

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

8. Lecture Neural Networks

Midterm exam CS 189/289, Fall 2015

Neural Nets Supervised learning

Least Mean Squares Regression

Artificial Neural Networks Examination, March 2002

Artificial Neural Networks 2

Expectation Maximization (EM)

CSC321 Lecture 4: Learning a Classifier

Gradient Descent Training Rule: The Details

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Deep Feedforward Networks. Sargur N. Srihari

Transcription:

Learning Vector Quantization Neural Computation : Lecture 18 John A. Bullinaria, 2015 1. SOM Architecture and Algorithm 2. Vector Quantization 3. The Encoder-Decoder Model 4. Generalized Lloyd Algorithms 5. Relation between SOMs and Noisy Encoder-Decoders 6. Voronoi Tessellation 7. Learning Vector Quantization (LVQ)

Architecture of a Self Organizing Map We continue to study the properties of the SOM known as a Kohonen Network. This has a feed-forward structure with a single computational layer of neurons arranged in rows and columns. Each neuron is fully connected to all the source units in the input layer: j Computational layer Winner takes all neurons Low (1 or 2) D discrete w ij i Input layer Real valued vector High D continuous Dimensional reduction A one dimensional map will just have a single row or column in the computational layer. L18-2

The SOM Algorithm The aim is to learn a feature map from the spatially continuous input space, in which the input patterns exist as vectors, to a low dimensional spatially discrete output space, which is formed by arranging the computational neurons into a grid. The stages of the SOM algorithm that achieves this can be summarised as follows: 1. Initialization Choose random values for the initial weight vectors w j. 2. Sampling Draw a sample training input vector x from the input space. 3. Matching Find the winning representative neuron I(x) that has weight vector D closest to the input vector, i.e. the minimum value of d j (x) = (x i w ji ) 2. 4. Updating Apply the weight update equation Δw ji = η(t) T j,i(x) (t) ( x i w ji ) where T j,i(x) (t) is a Gaussian neighbourhood and η(t) is the learning rate. 5. Continuation Keep returning to step 2 until the feature map stops changing. This iterative process converges to a feature map with numerous useful properties. i=1 L18-3

Properties of the Feature Map Once the SOM algorithm has converged, the feature map displays important statistical characteristics of the input space. Given an input vector x, the feature map Φ provides a single winning representative neuron I(x) in the output space, and its weight vector w I(x) provides the coordinates of the image of that neuron in the input space. High Dimensional Continuous Input Space Feature Map Φ I(x) w I(x) x Low Dimensional Discrete Output Space Properties: Approximation, Topological ordering, Density matching, Feature selection. L18-4

Vector Quantization It has already been noted that one aim of using a Self Organizing Map (SOM) is to encode a large set of input vectors {x} by finding a smaller set of representatives or prototypes or code-book vectors {w I(x) } that provide a good approximation to the original input space. This is the basic idea of vector quantization theory, the motivation of which is to facilitate dimensionality reduction or data compression. In effect, the error of the vector quantization approximation is the total squared distance D = x x w I (x) 2 between the input vectors {x} and their representatives {w I(x) }, and we clearly wish to minimize this. It can be shown that performing a gradient descent style minimization of D does lead to the SOM weight update algorithm, which confirms that it is generating the best possible discrete low dimensional approximation to the input space (at least assuming it does not get trapped in a local minimum of the error function). L18-5

The Encoder Decoder Model Probably the best way to think about vector quantization is in terms of general encoders and decoders. Suppose c(x) acts as an encoder of the input vector x, and x (c) acts as a decoder of c(x), then we can attempt to get back to x with minimal loss of information: Input Vector x Encoder c(x) Code c(x) Decoder x (c) Reconstructed Vector x (c) Generally, the input vector x will be selected at random according to some probability density function p(x). Then the optimum encoding-decoding scheme is determined by varying the functions c(x) and x (c) to minimize the expected distortion defined by D = x x (c(x)) x 2 2 = dxp(x) x x (c(x)) L18-6

The Generalized Lloyd Algorithm The necessary conditions for minimizing the expected distortion D in general situations are embodied in the two conditions of the Generalized Lloyd Algorithm: Condition 1. Given the input vector x and decoder x (c), choose the code c = c(x) to minimize the squared error distortion x x (c) 2. Condition 2. Given the code c, compute the reconstruction vector x (c) as the centroid of those input vectors x that satisfy Condition 1. To implement vector quantization, the algorithm works in batch mode by alternately optimizing the encoder c(x) in accordance with Condition 1, and then optimizing the decoder in accordance with Condition 2, until D reaches a minimum. To overcome the problem of local minima, it may be necessary to run the algorithm several times with different initial code vectors. Note the clear correspondence with the two alternating phases of the K-Means Clustering Algorithm. L18-7

The Noisy Encoder Decoder Model In real applications, the encoder-decoder system will also have to cope with noise in the communication channel. Such noise can be conveniently treated as an additive random variable ν with probability density function π(ν), so the model becomes: Noise ν Input Vector x Encoder c(x) Σ Decoder x (c) Reconstructed Vector x (c) It is then not difficult to see that the total expected distortion is now given by D ν = x x (c(x) + ν) x ν 2 = dxp(x) d νπ(ν) 2 x x (c(x)+ ν) L18-8

Conditions for Minimal Expected Distortion D ν The contribution to the total distortion D ν corresponding to a single data point x is D ν (x) = x x (c(x) + ν) ν 2 = 2 dνπ(ν) x x (c(x)+ ν) The partial derivative of the total distortion D ν with respect to the decoder x (c) is D ν x (c) = 2 π(c c(x))(x x (c)) = 2 dxp(x)π(c c(x))(x x (c)) x and at the minimum of the distortion this partial derivative will be zero, i.e. dxp(x)π(c c(x))(x x (c)) = 0 which can be solved for x (c) to give the decoder: x (c) = dxp(x)π(c c(x))x dxp(x)π(c c(x)) L18-9

Minimizing the Expected Distortion D ν Generally, given the decoder x (c), the optimal encoder c(x) for each data point x can be found straightforwardly using a simple nearest neighbour approach. The easiest way to arrive at the optimal decoder x (c) is usually to follow a standard gradient descent approach. The relevant partial derivative for a single data point x is D ν (x) x (c) = 2π(c c(x))(x x (c)) so appropriate gradient descent updates of the decoder to minimize the distortion are Δ x (c) = η D (x) ν x (c) ηπ(c c(x))(x x (c)) Since the optimal decoder depends on the encoder, and the optimal encoder depends on the decoder, an alternating series of updates will be required to end up with the minimum expected distortion D ν as in the K-Means Clustering Algorithm. L18-10

The Generalized Lloyd Algorithm with Noise Thus, to minimize the expected distortion D ν the encoder and decoder are iteratively updated to satisfy the conditions of the modified Generalized Lloyd Algorithm: Condition 1. Given the input vector x and decoder x (c), choose the code c = c(x) to minimize the distortion measure d νπ(ν) x x (c(x)+ ν) 2. Condition 2. Given the code c, compute the reconstruction vector x (c) to satisfy x (c) = dxp(x)π(c c(x))x dxp(x)π(c c(x)). At each iteration, Condition 1 is satisfied using a simple nearest neighbour approach for c(x), and then gradient descent updates Δ x (c) = ηπ(c c(x))(x x (c)) are applied to the reconstruction vector x (c) until Condition 2 is satisfied. Note that if the noise density function π(ν) is set to be the Dirac delta function δ(ν), that is zero everywhere except at ν = 0, the conditions here reduce to those in the no noise case considered previously. L18-11

Relation between a SOM and Noisy Encoder Decoder It is now easy to see that a direct correspondence can be established between the SOM algorithm and the noisy encoder-decoder model: Noisy Encoder-Decoder Model Encoder c(x) Reconstruction vector x (c) Probability density function π(c c(x)) SOM Algorithm Best matching neuron I(x) Connection weight vector w j Neighbourhood function T j,i(x) and that this relation provides a proof that the SOM algorithm is a vector quantization algorithm that generates an optimal approximation of the input space. Note that the topological neighbourhood function T j,i(x) in the SOM algorithm maps to a noise probability density function, so a Gaussian form for it can be justified. L18-12

Voronoi Tessellation A vector quantizer with minimum encoding distortion is called a Voronoi quantizer or nearest-neighbour quantizer. The input space is partitioned into a set of Voronoi or nearest neighbour cells each containing an associated Voronoi or reconstruction vector: The SOM algorithm provides a useful method for computing the Voronoi vectors (as weight vectors) in an unsupervised manner. One common application is to use it for finding a good set of basis function centres and widths in RBF networks. L18-13

Learning Vector Quantization (LVQ) Learning Vector Quantization (LVQ) is a supervised version of vector quantization that can be used when labelled input data is available. This learning technique uses the class information to reposition the Voronoi vectors slightly, so as to improve the quality of the classifier decision regions. It is a two stage process a SOM followed by LVQ: Teacher Inputs SOM LVQ Class labels This is particularly useful for pattern classification problems. The first step is feature selection the unsupervised identification of a reasonably small set of features in which the essential information content of the input data is concentrated. The second step is the classification where the feature domains are assigned to individual classes. L18-14

The LVQ Approach The basic LVQ approach is quite intuitive. It is based on a standard trained SOM with input vectors {x} and winning representative weights/voronoi vectors {w j }. The additional factor is that the input data points have associated class information. This means the known classification labels of the inputs can be used to find the best classification label for each w j, i.e. for each Voronoi cell, by simply determining for each cell which class has the largest number of input instances within it. It then follows that each new input without a class label can be classified simply by assigning it to the class of the Voronoi cell it falls within. The problem with this is that, in general, it is unlikely that the Voronoi cell boundaries will match up with the best possible classification boundaries, so the classification generalization performance will not be as good as possible. The obvious solution is to shift the Voronoi cell boundaries so they better match the classification boundaries. L18-15

The LVQ Algorithm The basic LVQ algorithm is a straightforward method for shifting the Voronoi cell boundaries to result in better classification. It starts from the trained SOM with input vectors {x} and weights/voronoi vectors {w j }, and uses the classification labels of the inputs to find the best classification label for each w j. The LVQ algorithm then checks the input classes against the Voronoi cell classes and moves the w j appropriately: 1. If the input x and the associated Voronoi vector/weight w I(x) (i.e. the weight of the winning output node I(x)) have the same class label, then move them closer together by Δw I(x) (t) = β(t)(x w I (x) (t)) as in the SOM algorithm. 2. If the input x and associated Voronoi vector/weight w I(x) have the different class labels, then move them apart by Δw I(x) (t) = β(t)(x w I(x) (t)). 3. The Voronoi vectors/weights w j corresponding to other input regions are left unchanged with Δw j (t) = 0. where β(t) is a learning rate that decreases with the number of iterations/epochs of training. In this way, one achieves better classification than from the SOM alone. L18-16

The LVQ2 Algorithm A second, improved, LVQ algorithm known as LVQ2 is sometimes preferred because it comes closer in effect to Bayesian decision theory. The same weight/vector update equations are used as in the standard LVQ, but they only get applied under certain conditions, namely when: 1. The input vector x is incorrectly classified by the associated Voronoi vector w I(x). 2. The next nearest Voronoi vector w S(x) does give the correct classification, and 3. The input vector x is sufficiently close to the decision boundary (perpendicular bisector plane) between w I(x) and w S(x). In this case, both vectors w I(x) and w S(x) are updated, using the incorrect and correct classification update equations respectively. Numerous other variations on this theme also exist (LVQ3, etc.), and this is still a fruitful research area for building better classification systems. L18-17

Overview and Reading 1. We began with an overview of SOMs and vector quantization. 2. Then we looked at general encoder-decoder models and noisy encoderdecoder models, and the Generalized Lloyd Algorithms for optimizing them. This led to a clear relation between SOMs and vector quantization. 3. We ended by studying Learning Vector Quantization (LVQ) from the point of view of Voronoi tessellation, and saw how the LVQ algorithm could optimize the class decision boundaries generated by a SOM. Reading 1. Haykin-1999: Sections 9.5, 9.7, 9.8, 9.10, 9.11 2. Beale & Jackson: Section 5.6 3. Gurney: Section 8.3.6 4. Hertz, Krogh & Palmer: Section 9.2 5. Ham & Kostanic: Sections 4.1, 4.2, 4.3 L18-18