Institute for Advanced Management Systems Research Department of Information Technologies Åbo Akademi University

Similar documents
Neural networks III: The delta learning rule with semilinear activation function

Neural Networks Lecture 7: Self Organizing Maps

Computational Intelligence Lecture 3: Simple Neural Networks for Pattern Classification

Neural Networks Lecture 2:Single Layer Classifiers

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Neural Networks Based on Competition

ADALINE for Pattern Classification

Artificial Neural Networks. Edward Gatt

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Training Multi-Layer Neural Networks. - the Back-Propagation Method. (c) Marcin Sydow

CSC321 Lecture 4: Learning a Classifier

Artificial Neural Networks Examination, June 2005

CSC321 Lecture 4: Learning a Classifier

Artificial Neural Networks Examination, March 2004

Single layer NN. Neuron Model

NN V: The generalized delta learning rule

Radial-Basis Function Networks

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

9 Classification. 9.1 Linear Classifiers

Simple Neural Nets For Pattern Classification

Artificial Neural Networks Examination, June 2004

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Radial-Basis Function Networks

EEE 241: Linear Systems

Neural networks and support vector machines

Artificial Neural Networks Examination, March 2002

Artificial Neural Network : Training

CS 4700: Foundations of Artificial Intelligence

Linear discriminant functions

Revision: Neural Network

Vorlesung Neuronale Netze - Radiale-Basisfunktionen (RBF)-Netze -

Multilayer Feedforward Networks. Berlin Chen, 2002

Neural Networks (Part 1) Goals for the lecture

Introduction to Neural Networks

3.4 Linear Least-Squares Filter

Neural networks (not in book)

Unit III. A Survey of Neural Network Model

Classic K -means clustering. Classic K -means example (K = 2) Finding the optimal w k. Finding the optimal s n J =

Multilayer Perceptron

Part 8: Neural Networks

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

18.6 Regression and Classification with Linear Models

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Νεςπο-Ασαυήρ Υπολογιστική Neuro-Fuzzy Computing

epochs epochs

y(x n, w) t n 2. (1)

Support Vector Machine. Industrial AI Lab.

Linear & nonlinear classifiers

Input layer. Weight matrix [ ] Output layer

Machine Learning and Adaptive Systems. Lectures 3 & 4

Instruction Sheet for SOFT COMPUTING LABORATORY (EE 753/1)

Multilayer Perceptrons and Backpropagation

Analysis of Interest Rate Curves Clustering Using Self-Organising Maps

Fixed Weight Competitive Nets: Hamming Net

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Support Vector Machine & Its Applications

Neural Networks and Deep Learning

Learning Vector Quantization

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Machine Learning Lecture 5

4. Multilayer Perceptrons

Financial Informatics XVII:

CSC321 Lecture 5 Learning in a Single Neuron

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Lecture 3: Multiclass Classification

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Neural Networks DWML, /25

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Security Analytics. Topic 6: Perceptron and Support Vector Machine

8. Lecture Neural Networks

Introduction to Support Vector Machines

Lecture 7 Artificial neural networks: Supervised learning

ECE 471/571 - Lecture 17. Types of NN. History. Back Propagation. Recurrent (feedback during operation) Feedforward

Neuro-Fuzzy Comp. Ch. 4 March 24, R p

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Artificial Neural Networks The Introduction

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

The Perceptron Algorithm

Learning and Memory in Neural Networks

CS798: Selected topics in Machine Learning

Artificial Neural Networks. Part 2

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Intelligent Systems Discriminative Learning, Neural Networks

Logistic Regression Logistic

Other Topologies. Y. LeCun: Machine Learning and Pattern Recognition p. 5/3

CSCI-567: Machine Learning (Spring 2019)

2- AUTOASSOCIATIVE NET - The feedforward autoassociative net considered in this section is a special case of the heteroassociative net.

Principles of Artificial Intelligence Fall 2005 Handout #7 Perceptrons

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Kernel Methods and Support Vector Machines

Neural Networks and the Back-propagation Algorithm

Linear Discriminant Functions

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Neural Networks Lecture 4: Radial Bases Function Networks

Machine Learning 2017

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Learning Methods for Linear Detectors

Transcription:

Institute for Advanced Management Systems Research Department of Information Technologies Åbo Akademi University The winner-take-all learning rule - Tutorial Directory Robert Fullér Table of Contents Begin Article c 2010 November 4, 2010 rfuller@abo.fi

Table of Contents 1. Kohonen s network 2. Semi-linear activation functions 3. Derivatives of sigmoidal activation functions

3 1. Kohonen s network Unsupervised classification learning is based on clustering of input data. No a priori knowledge is assumed to be available regarding an input s membership in a particular class. Rather, gradually detected characteristics and a history of training will be used to assist the network in defining classes and possible boundaries between them. Clustering is understood to be the grouping of similar objects and separating of dissimilar ones. We discuss Kohonen s network [Samuel Kaski, Teuvo Kohonen: Winner-take-all networks for physiological models of competitive learning, Neural Networks 7(1994), 973-984.] which classifies input vectors into one of the specified number of m

Section 1: Kohonen s network 4 categories, according to the clusters detected in the training set {x 1,..., x K }.!1 w1n wm1!m w11 wmn x1 x2 xn A winner-take-all learning network. Figure 1: A winner-take-all learning network. The learning algorithm treats treats the setthe of mset weight of mvectors weight as

Section 1: Kohonen s network 5 variable vectors that need to be learned. Prior to the learning, the normalization of all (randomly chosen) weight vectors is required. The weight adjustment criterion for this mode of training is the selection of w r such that, x w r = min x w i. i=1,...,m The index r denotes the winning neuron number corresponding to the vector w r, which is the closest approximation of the current input x. Using the equality

Section 1: Kohonen s network 6 x w i 2 =< x w i, x w i > =< x, x > 2 < w i, x > + < w i, w i > = x 2 2 < w, x > + w i 2 = x 2 2 < w i, x > +1, we can infer that searching for the minimum of m distances corresponds to finding the maximum among the m scalar products, Taking into consideration that < w r, x >= max i=1,...,m < w i, x >.

Section 1: Kohonen s network 7 x w2 w3 w1 The winner weight is w 2. Figure 2: The winner weight is w 2. Taking into consideration that w i = 1, i {1,..., m}, w i =1, i {1,...,m} the scalar product < w i, x > is nothing else but the projection Toc the scalar product <w i, x > is nothing Back else but Doc thedoc

Section 1: Kohonen s network 8 of x on the direction of w i. It is clear that the closer the vector w i to x the bigger the projection of x on w i. Note that < w r, x > is the activation value of the winning neuron which has the largest value net i, i = 1,..., m. When using the scalar product metric of similarity, the synaptic weight vectors should be modified accordingly so that they become more similar to the current input vector. With the similarity criterion being cos(w i, x), the weight vector lengths should be identical for this training approach. However, their directions should be modified. Intuitively, it is clear that a very long weight vector could lead

Section 1: Kohonen s network 9 to a very large output ofits neuron even if there were a large angle between the weight vector and the pattern. This explains the need for weight normalization. After the winning neuron has been identified and declared a winner, its weight must be adjusted so that the distance x w r is reduced in the current training step. Thus, x w r must be reduced, preferebly along the gradient direction in the weight space w r1,..., w rn.

Section 1: Kohonen s network 10 d x w 2 = = d < x w, x w > dw dw d(< x, x > 2 < w, x > + < w, w >) dw 2 d < w, x > dw = 0 2 d(w 1x 1 + + w n x n ) = d < x, x > dw + d < w, w > dw ( dw = 2 [w 1 x 1 + + w n x n ],..., w 1 ) T + d(w2 1 + + w 2 n) dw [w 1 x 1 + + w n x n ] w n ( + [w1 2 + + w 2 w n],..., [w1 2 + + w 1 w n)] 2 n ) T

Section 1: Kohonen s network 11 = 2(x 1,..., x n ) T + 2(w 1,..., w n ) T = 2(x w). It seems reasonable to reward the weights of the winning neuron with an increment of weight in the negative gradient direction, thus in the direction x w r. We thus have w r := w r + η(x w r ) where η is a small lerning constant selected heuristically, usually between 0.1 and 0.7. The remaining weight vectors are left unaffected. Kohonen s learning algorithm can be summarized in the following three steps

Section 1: Kohonen s network 12 Step 1 w r := w r + η(x w r ), o r := 1, (r is the winner neuron) Step 2 (normalization) w r := w r w r Step 3 w i := w i, o i := 0, i r (losers are unaffected) It should be noted that from the identity w r := w r + η(x w r ) = (1 η)w r + ηx it follows that the updated weight vector is a convex linear combination of the old weight and the pattern vectors.

linear combination of the old weight and the pattern vectors. Section 1: Kohonen s network 13 w2: = (1-!)w2 +!x x w2 w3 w1 Updating the weight of the winner neuron. Figure 3: Updating the weight of the winner neuron. In the end of the training process the final weight vectors point to the centers of gravity of classes. 9

In the end of the training process the final weight vectors point to the centers of gravity of classes. Section 1: Kohonen s network 14 w2 w3 w1 Figure 4: The final weight vectors point to the center of gravity of the The final weight vectors point to the center of classes. gravity of the classes. The network will only be trainable if classes/clusters of patterns are linearly The network separable will onlyfrom be trainable other if classes/clusters by hyperplanes passing through of patterns origin. are linearly separable from other classes by hyperplanes passing through origin.

Section 1: Kohonen s network 15 To ensure separability of clusters with a priori unknown numbers of training clusters, the unsupervised training can be performed with an excessive number of neurons, which provides a certain separability safety margin. During the training, some neurons are likely not to develop their weights, and if their weights change chaotically, they will not be considered as indicative of clusters. Therefore such weights can be omitted during the recall phase, since their output does not provide any essential clustering information. The weights of remaining neurons should settle at values that are indicative of clusters.

16 2. Semi-linear activation functions In many practical cases instead of linear activation functions we use semi-linear ones. The next table shows the most-often used types of activation functions. Linear: f(< w, x >) = w T x Piecewise linear: 1 if < w, x > > 1 f(< w, x >) = < w, x > if < w, x > 1 1 if < w, x > < 1 Hard: f(< w, x >) = sign(w T x)

Unipolar sigmoidal: f(< w, x >) = 1/(1 + exp( w T x)) 17 Bipolar sigmoidal (1): Bipolar sigmoidal (2): f(< w, x >) = tanh(w T x) f(< w, x >) = 2 1 + exp(w T x) 1 3. Derivatives of sigmoidal activation functions The derivatives of sigmoidal activation functions are extensively used in learning algorithms.

Section 3: Derivatives of sigmoidal activation functions 18 If f is a bipolar sigmoidal activation function of the form f(t) = 2 1 + exp( t) 1. Then the following equality holds f (t) = 2 exp( t) (1 + exp( t)) 2 = 1 2 (1 f 2 (t)). If f is a unipolar sigmoidal activation function of the form 1 f(t) = 1 + exp( t). Then f satisfies the following equality f (t) = f(t)(1 f(t)).

f (t) = 2 exp( t) (1 + exp( t)) = 1 2 2 (1 f 2 (t)). Section 3: Derivatives of sigmoidal activation functions 19 Bipolar activation function. Figure 5: Bipolar activation function. 13 Exercise 1. Let f be a bip olar sigmoidal activation function of the form 2 f(t) = 1 + exp( t) 1. Show that f satisfies the following differential equality

Section 3: Derivatives of sigmoidalf activation (t) =f(t)(1 functions f(t)). 20 Unipolar activation function. Figure 6: Unipolar activation function. Exercise 1. Let f be a bip olar sigmoidal activation function of the form 2 f(t) = f (t) = 1 1 + exp( t) 1. 2 (1 f 2 (t)). Solution Show 1. Bythat using f satisfies the chain the following rule for derivatives differential equal- of composed functions we get 14 f (t) = 2exp( t) [1 + exp( t)] 2.

Section 3: Derivatives of sigmoidal activation functions 21 From the identity we get [ 1 1 2 ( ) 2 ] 1 exp( t) = 2exp( t) 1 + exp( t) [1 + exp( t)] 2 Which completes the proof. 2exp( t) [1 + exp( t)] 2 = 1 2 (1 f 2 (t)). Exercise 2. Let f be a unipolar sigmoidal activation function of the form 1 f(t) = 1 + exp( t).

Section 3: Derivatives of sigmoidal activation functions 22 Show that f satisfies the following differential equality f (t) = f(t)(1 f(t)). Solution 2. By using the chain rule for derivatives of composed functions we get f exp( t) (t) = [1 + exp( t)] 2 and the identity 1 [1 + exp( t)] = exp( t) ( 1 exp( t) ) 2 1 + exp( t) 1 + exp( t) verifies the statement of the exercise.