Artificial Neural Networks (ANN)

Similar documents
NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Deep Feedforward Networks

4. Multilayer Perceptrons

Introduction to Neural Networks

Multilayer Perceptron

Reading Group on Deep Learning Session 1

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Artificial Neural Networks Examination, June 2005

Lecture 3: Pattern Classification

Lecture 3 Feedforward Networks and Backpropagation

Logistic Regression & Neural Networks

Feedforward Neural Nets and Backpropagation

Artificial Neural Networks 2

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

MLPR: Logistic Regression and Neural Networks

Artificial Neural Network : Training

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Artificial Neural Networks

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Pattern Recognition and Machine Learning

Learning and Memory in Neural Networks

Feed-forward Network Functions

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3: Pattern Classification. Pattern classification

Neural Networks and the Back-propagation Algorithm

Multilayer Perceptrons and Backpropagation

Neural Networks. Nethra Sambamoorthi, Ph.D. Jan CRMportals Inc., Nethra Sambamoorthi, Ph.D. Phone:

Computational statistics

Unit III. A Survey of Neural Network Model

Introduction to Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Lecture 4: Perceptrons and Multilayer Perceptrons

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Lecture 5: Logistic Regression. Neural Networks

y(x n, w) t n 2. (1)

Machine Learning Lecture 5

Linear Models for Classification

Advanced statistical methods for data analysis Lecture 2

Machine Learning

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks Lecture 4: Radial Bases Function Networks

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Data Mining Part 5. Prediction

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Artificial Neural Networks

PATTERN CLASSIFICATION

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Deep Feedforward Networks

A Maximum-likelihood connectionist model for unsupervised learning over graphical domains

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Pattern Classification

STA 4273H: Statistical Machine Learning

Supervised Learning: Non-parametric Estimation

Linear Discrimination Functions

Artificial Neural Networks. Edward Gatt

Multilayer Neural Networks

Neural Networks and Deep Learning

Neural Networks and Fuzzy Logic Rajendra Dept.of CSE ASCET

Linear & nonlinear classifiers

ECE521 week 3: 23/26 January 2017

Statistical Machine Learning from Data

Lecture 6. Regression

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Artificial Neural Networks

Neural networks. Chapter 20. Chapter 20 1

Notation. Pattern Recognition II. Michal Haindl. Outline - PR Basic Concepts. Pattern Recognition Notions

Part 8: Neural Networks

Artificial Neural Networks

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Notes on Back Propagation in 4 Lines

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

EEE 241: Linear Systems

Machine Learning

Lecture 7 Artificial neural networks: Supervised learning

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Supervised Learning in Neural Networks

Artificial Neural Networks

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

POWER SYSTEM DYNAMIC SECURITY ASSESSMENT CLASSICAL TO MODERN APPROACH

Artificial Neural Networks Examination, March 2004

Christian Mohr

Multi-layer Neural Networks

Tutorial on Machine Learning for Advanced Electronics

Empirical Risk Minimization

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Machine Learning Lecture 7

Artificial Neural Networks The Introduction

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Neural networks and support vector machines

Transcription:

Artificial Neural Networks (ANN) Edmondo Trentin April 17, 2013

ANN: Definition The definition of ANN is given in 3.1 points. Indeed, an ANN is a machine that is completely specified once we define its: 1. architecture 2. dynamics 3. learning 3.1 generalization In the following we will give a definition of each one of these quantities.

ANN definition #1: architecture An ANN is a (directed or undirected) graph A = (V,E) whose vertices v V are called units or neurons, and whose edges E are called (synaptic) connections. V has three subsets: I V, the input units; O V, the output units; H V, the (hidden) units, s.t. H I = H O =. Vertices and edges of the graph are labeled: the edge labels, known as the weights, are scalars w R; vertex labels are real-valued functions, known as the activation functions.

Typical activation functions f(.) : R R Step functions, a.k.a. Threshold Logic Units (TLU): Linear functions f(a) = a or f(a) = wa+r:

(Logistic) sigmoids f(a) = 1/(1+e a ), 0 < f(a) < 1: Hyperbolic tangent sigmoid f(a) = tanh(a), 1 < f(a) < 1: Gaussian f(a) = N(a;0,1):

Typical activation functions f(.) : R d R Scalar fields, such as multivariate Gaussian kernels f(a) = N(a;µ,Σ):

The architecture, or topology, of an ANN is thus specified by (V,E,I,O,H) and by the labels {w} and {f( )}:

ANN definition #2: dynamics The dynamics of an ANN specifies how the signal propagates through the network, that is to say how the ANN becomes alive. 1. the ANN is fed with an input x, x = (x 1,...,x d ). To this end, the ANN must have d input units, I = d where x i is the input for i-th input unit. The input units do not transform the signal, but they forward it to other neurons along their outgoing connections. 2. in traversing a connection having weight w, the signal o undergoes a transformation, namely it is multiplied by w.

3. let j be a generic unit in H O. It receives as many signals w 1 o 1,...,w n o n as the number of incoming connections. There are two cases: 3.1 the activation function associated with the unit j is scalar, i.e. f : R R. In this case the input (i.e., activation) a j to j is the weighted sum of all incoming contributions: n a j = w i o i (1) and the output o j from the unit is obtained as o j = f(a j ) (which, in turn, is propagated forward along the outgoing connections to other neurons). 3.2 the activation function associated with the unit j is f : R n R. In this case, f( ) is applied to a = (w 1 o 1,...,w n o n ) such that o j = f(a), e.g. a Gaussian evaluated on a. 4. the signal propagates this way through the whole ANN, up to the output units. Let k be an output unit (i.e., k O). Then, the value o k is the k-th output of the ANN. If the latter has m output units, the ANN output is the overall, m-dimensional vector Y = (y 1,...,y m ) where y k = o k. i=1

Note: the time trigger (clock) of ANNs depends on the specific family of networks at hand. Remark: the ANN may also be seen as a particular computation paradigm (like an analog computer, or Turing machine) whose units are simple analog processors. The ANN topology specifies the hardware architecture, the value of the connection weights w is the software, while the ANN dynamics (the living machine) represents the running process.

ANN definition #3: learning (and generalization) Once architecture and dynamics have been defined, the definition of an ANN is completed by introducing the main feature of ANNs: the learning algorithm. ANNs learn from the examples contained in a training set τ = {z}. There are three instances of learning: supervised learning: τ = {(x,y)} unsupervised learning: τ = {x} reinforcement learning: τ = {x 1,x 2,...(x t,y),x t+1,...} where a reinforcement signal y (either a penalty or a reward) is given every now and then. Learning is thought of as a process of progressive modification of the connection weights, aimed at inferring the (universal) laws underlying the data. Far from being a bare memorization of the data, the laws learned this way are expected to generalize to new data, previously unseen (generalization capability).

Multilayer Perceptron (MLP) Fully connected layered architecture with feed-forward propagation of input signals. Activation functions f : R R, usually sigmoid and/or linear. All units in a given layer do share the same form of activation function. Dynamics: simultaneous propagation of the signal (with no delays) from the input layer to the output layer, traversing zero, one, or more hidden layers. Learning: supervised, via the gradient method (backpropagation).

Special case: Simple Perceptron (SP) Convention: in MLPs, the input layer is not counted (it acts as a placeholder); thence, SP is a 1-layer ANN. Notation: w ij denotes the weight of the connection between j-th input unit and i-th output unit Computed function: let f : R R be the activation function associated with the output unit(s). Then: y i = f(a i ) = f( d w ij x j ) (2) In particular, if f is linear then y i = j w ijx j = w i t x (Simple Linear Perceptron) j=1

SP Learning: delta rule Training set τ = {(x,ŷ) : x R d,ŷ R m } Batch criterion function: C(τ,W) where W = {w ij i = 1,...,m j = 1,...,d} Once τ is given, we have C(τ,W) = C(W) = 1 m 2 τ i=1 (ŷ i y i ) 2 In practice we resort to on-line learning, relying on the following criterion function: C(W) = 1 2 m (ŷ i y i ) 2 (3) to be minimized, in an iterative fashion, over each training example Gradient-descent prescribes this delta rule: i=1 w ij = η C w ij (4) which updates a generic weight w ij to its new value w ij = w ij + w ij

Vectorial notation: w = w η w C(w) C Calculation of w ij goes as follows. Let f i (a i ) be the activation function associated with i-th output unit. Then: { } C 1 m = (ŷ k y k ) 2 w ij 2 = 1 2 w ij m k=1 k=1 w ij (ŷ k y k ) 2 = 1 (ŷ i y i ) 2 2 w ij = (ŷ i y i ) y i w ij (5) Since w ij = η C w ij, from Eq. (5) we have: Now, let s calculate y i w ij. w ij = η(ŷ i y i ) y i w ij (6)

y i = f i(a i ) w ij w ij = f i(a i ) a i a i w ij = f i (a i ) w ij = f i (a i )x j d w ik x k k=1 Being w ij = η(ŷ i y i ) y i w ij, the delta rule takes the form: w ij = η(ŷ i y i )f i (a i )x j = ηδ i x j where we defined δ i = (ŷ i y i )f i (a i). If f(.) is linear then f i (a i) = 1, thence w ij = η(ŷ i y i )x j

MLP: Backpropagation Training set: τ = {(x t,ŷ t ) : t = 1,...,n} Batch criterion function: C(τ,w) = C(W) = 1 2 n t=1 m i=1 (ŷ i y i ) 2 Online criterion function: : C(W) = 1 2 m i=1 (ŷ i y i ) 2 For a generic weight w we apply gradient descent: w = w + w with w = η C w. Let f i (a i ) be the activation function associated with the generic i-th unit in the MLP. Assume the MLP has l layers, say L 0 (the input layer), L 1,...,L l 1 (the hidden layers), and L l (the output layer). We write k L j in order to refer to k-th unit in j-th layer. We aim at calculating w = η C w for a generic weight w in the MLP

Case 1: w is in the output layer Let w = w ij, where j L l 1 and i L l. We have: { } C 1 m = (ŷ k y k ) 2 w ij 2 = 1 2 w ij m k=1 k=1 w ij (ŷ k y k ) 2 = 1 (ŷ i y i ) 2 2 w ij = (ŷ i y i ) y i w ij y i = f i(a i ) a i w ij a i w ij = f i (a i ) w ik o k w ij = f i (a i )o j k

In summary, so far we have: C w ij = (ŷ i y i ) y i w ij y i w ij = f i (a i)o j from which: w ij = η(ŷ i y i )f i (a i )o j Thus, if i L l and j L l 1, by defining δ i = (ŷ i y i )f i (a i ) we obtain the following delta rule: w ij = ηδ i o j (7) where o j = f j (a j ). Of course, it is the same learning rule we found for SPs, the only difference being that o j stands in place of x j.

Case 2: w is in a hidden layer To fix ideas, let w = w jk where j L l 1 and k L l 2 (still, the results we are presenting hold true in all the hidden layers L l 3,...,L 0 ). Again, we let w jk = η C w jk. We can write: C w jk = Let s calculate y i o j. w jk = o j { 1 = 2 = { { 1 2 { 1 2 m i=1 m i=1 } m (ŷ i y i ) 2 i=1 } m (ŷ i y i ) 2 o j i=1 o j (ŷ i y i ) 2 (ŷ i y i ) y i o j } } w jk o j w jk o j w jk

from which, since C w jk = C w jk = y i = f i(a i ) a i o j a i o j = f i (a i ) w ij o j o j { = f i (a i )w ij j { m i=1 (ŷ i y i ) y i o j } oj w jk, we obtain: m i=1 ( m ) = w ij δ i i=1 ( m ) = w ij δ i i=1 (ŷ i y i )f i (a i )w ij } f j (a j ) a j f j (a j )o k a j w jk o j w jk

Thus far, w jk = η C w jk and C w jk = ( m i=1 w ijδ i )f j (a j)o k. Thence, we can write: ( m ) w jk = η w ij δ i f j (a j )o k (8) i=1 By defining δ j = ( m i=1 w ijδ i )f j (a j), the following delta rule is obtained: w jk = ηδ j o k (9) which is in the same form as in case 1. More generally, given any weight w jk in any layer, the generalized delta-rule w jk = ηδ j o k can be applied, provided that we define: { (ŷj y j )f j δ j = (a j) if j L l ( i L k+1 w ij δ i )f j (a j) if j L k where k = l 1,...,0 Learning is iterated for a number of consecutive epochs on the training set (an epoch is a cycle of application of the delta-rule over all the data in τ)

Stopping Criteria 1. upon a fixed number of epochs 2. once the error C falls below a given threshold: C < ǫ 3. once the gain C falls below a given threshold: C < ǫ 4. evaluating (via cross-validation) the generalization capability of the MLP. In particular: Learning curve: error measured on the training set. Generalization curve: error measured on the validation set. A similar phenomenon turns up as a function of the increasing complexity of the machine (Vapnik-Chervonenkis Dimension)

Practical Issues and Some Insights 1. The weights need random initialization, uniformly over a small zero-centered range (w ij [ ε,ε]) in order to avoid the saturation of sigmoids: 2. For the same reason, as well as for numerical stability issues, input values (x 1,x 2,...,x d ) must be normalized. For instance: x i [0,1], or x i [ 1,1]. A typical choice that guarantees values in the [ 1,1] range with null mean is: for each component i = 1,...,d, subtract its average value and divide by the maximum absolute value of i-th component. 3. If sigmoids are used in the output layer, then also the target outputs shall be normalized to [0, 1].

4. Regularization: learning is modified in order to improve the generalization capabilities of the MLP. An idea is to force the complexity of the machine to be as small as possible (i.e., to limit the VC-dimension): 4.1 weight-sharing: some connections are forced to share the same weight value w, and w is computed by averaging over the different values yielded by the delta-rule over the corresponding, individual instances of w 4.2 weight-decay: numerically smaller weights entail simpler solutions: C = 1 (ŷ i y i ) 2 + α (w ij ) 2 (10) 2 2 i where α 2 i,j (w ij) 2 is the regularization term. For a generic w we have w = η C w = η { 1 w 2 i (ŷ i y i ) 2} ηαw i.e. in addition to the usual w due to the delta rule, learning w requires decreasing it by a fraction ηα of its own. i,j

5. We can use more flexible activation functions: λ is the amplitude: f(a) = λ 1+e (a b)/θ θ is the smoothness (high impact on the derivative):

b is the bias (or, offset): There are (gradient-descent) algorithms for learning λ, θ, b. In particular, the bias may be learned just as a regular weight by means of a diagonalization unit having constant output set to 1. Figure: Bias

6. Learning may be made more stable by including a momentum term (or, inertia) in the delta rule: w(t +1) = η C +ρ w(t) (11) w(t) where ρ (0,1) is the momentum rate A supervised MLP learns a transformation φ : R d R m which can be applied to: function approximation: y = φ(x) regression (linear or non-linear): y = φ(x)+ǫ, where ǫ is the (Gaussian) noise pattern classification: y i = g i (x i )

In 2-class classification, in particular: where y = g(x) = u t x +v. In practice, the Simple Linear Perceptron is a linear discriminant (Widrow-Hoff algorithm: ŷ = 1 if the pattern belongs to the class at hand, 0 otherwise). It is seen that if the net has c outputs (for the c classes), the bias values will be different for each output, such that g i (x) = u i t x +v i, for each i = 1,...,c.

If we train a MLP on the same training set (e.g., labeled with ŷ i = 0/1, or ŷ i = 1/1) we obtain non-linear discriminant functions (if the output layer is linear, then we have a Generalized Linear Discriminant), having flexible (not necessarily convex and connected) decision regions. Figure: Nonlinear separation surfaces

Universality of MLPs 1 (Cybenko, Lippmann, 1989): a MLP with an hidden layer of sigmoid functions and a linear output layer is a universal machine. Formally: let ǫ R + be any arbitrarily small positive constant. ϕ : R d R m s.t. ϕ is continuous and limited over a closed support X R d, then a two (hidden-sigmoid, linear-output) layers MLP φ( ) exists such that φ( ) ϕ( ) X, < ǫ (where X, is the Chebyshev norm over X). Remarks: universality says that esists a MLP..., but it does not say: how many neurons such a MLP has; what is the correct value of its weights ; if and how it is possible to learn the correct weights. Nonetheless, universality proofs the flexibility and the modeling capabilities of MLPs. 1 Note: http://en.wikipedia.org/wiki/cybenko_theorem

Mixtures of Experts We can combine k neural modules, or experts, E 1,E 2,...,E k : Each E i is a MLP computing the function y i = ϕ i (x). The overall machine thus realizes: y = k α i ϕ i (x) (12) i=1 where α i [0,1]. α i is the credit assigned from the gatherer (or, gating module ) to E i over input x.

(i) The feature space may be partitioned (divide and conquer) into k disjoint regions A 1,A 2,...,A k and E i is specialized (i.e., trained and used) on A i. The gatherer assigns credits this way: if x A i, then α i = 1 and α j = 0 for all j i (during both training and test). (ii) Instead of fixed regions A i, we may assume classes of credit ω 1,...,ω k having pdfs p i (x ω i ) (overlapping regions) which express the likelihood of E i being competent over any input pattern, e.g. p(x ω i ) = N(x;µ,Σ). The gatherer assigns credits α i = P(ω i x) under the condition k j=1 α j = 1, and y = k i=1 P(ω i x)ϕ i (x) (in both training and test). To do that, a model (estimator) of P(ω i x) is needed. To this end, Bayesian/statistical techniques are applied: in so doing, a relevant instance of a neural/statistical hybrid emerges.

(iii) Instead of fixing regions or classes a priori, the machine can learn the gating function during training, via a gating network (i.e., a referee) whose i-th output is the credit α i (x): The gating network is trained in parallel with the experts, s.t. its output (credit) α j weights the contribution form j-th expert E j to the overall output, thence α j affects the backpropagation of derivatives of the criterion function through E j.

Autoassociative neural net It is a MLP which learns to map its inputs onto themselves. Training set (for BP): τ = {(x,x)}. Thus, it may be seen as an example of unsupervised learning.

Training 1. The ANN is trained to map x onto x via BP 2. The hidden layer(s) turn(s) out to be a lower-dimensionality representation of the feature space. 3. Let us remove the upper layer(s): In so doing, we obtain a feature extractor realizing a dimensionality reduction of the original feature space. Remark: this is an instance of the aforementioned 3rd kind of non-parametric estimator.

Let us move one step further: 1. let R d be the feature space, and let τ = {(x,ŷ) x R d,ŷ R m }. Assume our goal is training a MLP (on τ) to realize ϕ : R d R m. 2. From τ we define τ = {(x,x) (x,ŷ) τ for some ŷ} and we train the autoassociative network on τ 3. we remove the upper layer(s) from the autoassociative ANN, and we map R d onto R d. From τ we obtain τ = {(z,ŷ) z R d}. In practice, we transform the original training patterns x R d into z R d where d < d, hence R d is a sub-space of R d. 4. Finally (as sought), a MLP is trained via backpropagation on τ, s.t. it realizes ϕ : R d R m 5. we mount the two MLPs on the top of each other, keeping their weights clamped, obtaining the overall function ϕ : R d R m sought 6. later, the weights of the overall deep MLP (point 5) may be further tuned via backpropagation on τ, resulting in an improved model of ϕ

MLP: decision making and probability estimation There are two ways of using the MLP as a non-parametric estimator for pattern recognition: 1. use MLPs as discriminant functions: as we said, train the MLP (like linear discriminants with Widrow-Hoff) via BP on a training set labeled with 0/1 target outputs, and assume the i-th output unit represents g i (x) 2. probabilistic interpretation of the MLP output(s): we train one (or, more) MLP(s) for the (non-parametric) estimation of the probabilistic quantities involved in Bayes theorem: (P(ω i x), or p(x ω i )).

The MLP output may be interpreted as a probability if and only if it is constrained within the [0,1] range. This is guaranteed if sigmoid activation functions are used. If the MLP has c outputs and we want them to represent P(ω 1 x),p(ω 2 x),...,p(ω c x), respectively, the following additional constraint must be satisfied: c i=1 P(ω i x) = 1. This is granted if we let: P(ω i x) = y i (x) c j=1 y j(x) where y j (x) is the j-th ANN output over current input 2 x. Problem: what target output shall we use in order to train the MLP to model P(ω i x)? The solution is yielded by the following Theorem (Lippmann, Richard): by using the values 0 (if x / ω i ) or 1 (if x ω i ) as the target output for i-th unit, minimization of the criterion C(w) (quadratic error) equals to minimizing the quadratic error between the i-th output and P(ω i x). (13) 2 Use this notion also at training time.

In other words, if we reach the global minimum C MIN ( w) using 0/1 targets and the right MLP architecture, we are guaranteed that the MLP obtained this way (having weights w) is the optimal Bayesian classifier. In practice: using backpropagation on real-world data we never reach the global minimum, thence i-th output never coincides with P(ω i x). Moreover, in general c i=1 y i(x) 1. This is why we need to impose the constraint P(ω i i x) = 1 explicitly. Another way of satisfying the constraint is to use a SOFTMAX normalization, by taking P(ω i x) y i (x) = e a i c j=1 ea j (14)

Q: may we use MLP as an estimate of the class-conditional pdfs? Since y i (x) P(ω i x) = p(x ω i)p(ω i ) (15) p(x) we can write y i (x) P(ω i ) p(x ω i) p(x) which is a scaled likelihood. If p(x) is known, we can thus estimate the unknown pdf p(x ω i ) via MLP (provided that an estimate of P(ω i ) is used). Furthermore, should P(ω i ) change over test time (e.g., as a consequence of changed statistical conditions) assuming a new value P (ω i ), we obtain the corresponding new estimate P (ω i x) of the class-posterior as: (16) P (ω i x) = p(x ω i) p(x) P (ω i ) y i(x) P(ω i ) P (ω i ) (17) which requires no re-training of the MLP.

Radial Basis Function (RBF) Networks Figure: RBF y i = k w ij ϕ j (x)+w io (18) j=1 Note the similarity with Parzen Window. Note: it is a Generalized Linear Discriminant.

x µk 2 2σ The RBF (or, kernel) is defined as ϕ k (x) = e k 2 Simplified (yet, degenerate) RBF: a MLP with one Gaussian hidden layer and a linear output layer. Universality: just like MLPs, RBFs are universal approximators. i (ŷ i y i ) 2. Learning (supervised): C(τ,w) = 1 2 Approach 1: via gradient descent over C(w), learning the parameters w ij,w io,µ k and σ k. Approach 2: µ k and σ k are estimated statistically, then the hidden-to-output weights (i.e., coefficients of the output linear transformation) are estimated via (i) linear algebra (say, matrix inversion) techniques, or (ii) gradient descent. Remark: RBFs realize mixtures of Gaussian pdfs. Thence, they are particularly suitable to pdf estimation. Maximum-likelihood (ML) criterion: we apply gradient-ascent over ML to the RBF in order to estimate a pdf. As long as the hidden-to-output weights sum to one, we are good. This does not work with MLPs, since the constraint p(x)dx = 1 is violated (divergence problem).