Slide05 Haykin Chapter 5: Radial-Basis Function Networks

Similar documents
In the Name of God. Lectures 15&16: Radial Basis Function Networks

Radial-Basis Function Networks

Neural Networks Lecture 4: Radial Bases Function Networks

Radial Basis Function (RBF) Networks

Radial-Basis Function Networks

Vorlesung Neuronale Netze - Radiale-Basisfunktionen (RBF)-Netze -

Linear Regression and Its Applications

ECE521 week 3: 23/26 January 2017

SVMs: nonlinearity through kernels

Linear & nonlinear classifiers

Linear & nonlinear classifiers

Linear Models for Regression. Sargur Srihari

Lecture 5: Logistic Regression. Neural Networks

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear discriminant functions

Statistical Pattern Recognition

Lecture 4: Perceptrons and Multilayer Perceptrons

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Lecture 3: Introduction to Complexity Regularization

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Perceptron Revisited: Linear Separators. Support Vector Machines

Machine Learning Lecture 5

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

3.4 Linear Least-Squares Filter

Mathematical Formulation of Our Example

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Radial Basis Function Networks. Ravi Kaushik Project 1 CSC Neural Networks and Pattern Recognition

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Fitting Linear Statistical Models to Data by Least Squares I: Introduction

Plan of Class 4. Radial Basis Functions with moving centers. Projection Pursuit Regression and ridge. Principal Component Analysis: basic ideas

Fitting Linear Statistical Models to Data by Least Squares: Introduction

A New Trust Region Algorithm Using Radial Basis Function Models

Least Squares Approximation

4 Bias-Variance for Ridge Regression (24 points)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

you expect to encounter difficulties when trying to solve A x = b? 4. A composite quadrature rule has error associated with it in the following form

Cheng Soon Ong & Christian Walder. Canberra February June 2018

EE731 Lecture Notes: Matrix Computations for Signal Processing

Multi-layer Neural Networks

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Scattered Interpolation Survey

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

In the Name of God. Lecture 11: Single Layer Perceptrons

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Week 3: Linear Regression

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

9.2 Support Vector Machines 159

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

NN V: The generalized delta learning rule

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

Lecture 2 Machine Learning Review

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Preprocessing & dimensionality reduction

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

A Least Squares Formulation for Canonical Correlation Analysis

A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation

CS 7140: Advanced Machine Learning

L20: MLPs, RBFs and SPR Bayes discriminants and MLPs The role of MLP hidden units Bayes discriminants and RBFs Comparison between MLPs and RBFs

Introduction to Machine Learning

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Brief Introduction of Machine Learning Techniques for Content Analysis

Neural Networks and the Back-propagation Algorithm

Announcements. Proposals graded

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Reproducing Kernel Hilbert Spaces

CS798: Selected topics in Machine Learning

Recap from previous lecture

Regularized Least Squares

Data Mining Part 5. Prediction

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Introduction to Hilbert Space Frames

Lecture: Face Recognition and Feature Reduction

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

L11: Pattern recognition principles

Kernel Methods. Barnabás Póczos

CS281 Section 4: Factor Analysis and PCA

Lecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Models for Classification

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Since the determinant of a diagonal matrix is the product of its diagonal elements it is trivial to see that det(a) = α 2. = max. A 1 x.

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Kernels A Machine Learning Overview

Linear Regression (continued)

Neural networks and support vector machines

Transcription:

Slide5 Haykin Chapter 5: Radial-Basis Function Networks CPSC 636-6 Instructor: Yoonsuck Choe Spring Learning in MLP Supervised learning in multilayer perceptrons: Recursive technique of stochastic approximation, eg, backprop Design of nnet as a curve-fitting (approximation) problem, eg, RBF Curve-fitting: Finding a surface in a multidimensional space that provides a best fit to the training data Best fit measured in a certain statistical sense RBF is an example: hidden neurons forming an arbitrary basis for the input patterns when they are expanded into the hidden space These basis are called radial basis functions Radial-Basis Function Networks Cover s Theorem W φ φ φ 3 φ 4 Hidden Cover s theorem on the separability of patterns: A complex pattern-classification problem cast in a high-dimensional space nonlinearly is more likely to be linearly separable than in a low-dimensional space Three layers: Hidden: nonlinear transformation from input to hidden space : linear activation Principal motivation: Cover s theorem pattern classifiction casted in high-dimensional space is more likely to be linearly separable than in low-dimensional space 3 Basic idea: nonlinearly map points in the input space to a hidden space that has a higher dimension than the input space Once the proper mapping is done, simple, quick algorithms can be used to find the separating hyperplane 4

φ-separability of Patterns N input patterns X = {x, x,, x N } in m -dimensional space The inputs belong to either of two sets X and X : they form a dichotomy The dichotomy is separable wrt a family of surfaces if a surface exists in the family that separates the points in class X from X For each x X, define an m -vector {φ i (x) i =,,, m }: φ(x) = [φ (x), φ (x),, φ m (x)] T that maps inputs in m -D space to the hidden space of m -D φ i (x) are called the hidden functions, and the space spanned by these functions is called the hidden space or feature space A dichotomy is φ-separable if there exists an m -D vector w such that: w T φ(x) >, x X w T φ(x) <, x X with separating hyperplane w T φ(x) = 5 Cover s Theorem Revisited Given a set X of N inputs picked from the input space independently, and suppose all the possible dichotomies of X are equiprobable Let P (N, m ) denote the probability that a particular dichotomy picked at random is φ-separable, where the family of surfaces has m degrees of freedom In this case, where «N m X P (N, m ) = l m! = m= N m l(l )(l )(l m + ) 6 m!! Cover s Theorem: Interpretation Example: XOR (again!) Separability depends on: () particular dichotomy, and () the distribution of patterns in the input space The derived P (N, m ) states that the probability of being φ-separable is equivalent to the cumulative binomial distribution corresponding to the probability that (N ) flips of a fair coin will result in (m ) or fewer heads In sum, Cover s theorem has two basic ingredients: Nonlinear mapping to hidden space with φ i (x) (i =,,, m ) With Gaussian hidden functions, the inputs become linearly separable in the hidden space: High dimensionality of hidden space compared to the input space (m > m ) Corollary: A maximum of m patterns can be linearly separated by a hidden space of m -D 7 φ (x) = exp( x t ), φ (x) = exp( x t ), 8 t = [, ] T t = [, ] T

D Gaussian RBF Learning: Overview 9 8 7 6 5 4 3 exp(-(x*x+y*y)) 9 8 7 6 5 4 3 W φ φ φ 3 φ 4 Hidden -3-3 - - - 3-3 Basically, t i determines the center of the D Gaussian, φ i (x) = exp( x t i ) and points x that are equidistance from the center have the same φ value 9 The RBF learning problem boils down to two tasks: How to determine the parameters associated with the radial-basis functions in the hidden layer φ i (x) (eg, the center of the Gaussians) How to train the hidden-to-output weights?: This part is relatively easy RBF as an Interpolation Problem RBF and Interpolation W φ φ φ 3 φ 4 Hidden m -D input to -D output mapping s : R m R The map s can be thought of as a hypersurface Γ R m + Training: fit hypersurface Γ to the training data points Generalization: interpolate between data points, along the reconstructed surface Γ Given {x i R m i =,,, N} and N labels {d i R i =,,, N}, find F : R N R such that F (x i ) = d i forall i Interpolation is formulated as F (x) = w i φ( x x i ), where {φ( x x i ) i =,,, N} is a set of N arbitrary (nonlinear) functions known as radial-basis functions The known data points x i R m, i =,,, N are treated as the centers of the RBFs (Note that in this case, all input data need to be memorized, as in instance-based learning, but this is not a necessary requirement)

RBF and Interpolation (cont d) RBF and Interpolation (cont d) F (x) = w i φ( x x i ), So, we have N inputs and N hidden units, and one output unit Expressing everything (all N input-output pairs) in matrix form: 3 3 φ φ φ N w φ φ φ N w = 6 4 7 6 5 4 7 6 5 4 φ N φ N φ NN w N where φ ji = φ( above as: x j {z} x i {z} Center φw = d 3 d d d N ) We can abbreviate the 3, 7 5 From φw = d, we can find an explicit solution w: w = φ d, assuming φ is nonsingular (Note: in general, the number of hidden units is much less than the number of inputs, so we don t always have φ as a square matrix! We ll see how to handle this, later) Nonsingularity of φ is guaranteed by Micchelli s theorem: Let {x i } N be a set of distinct points in Rm Then the N -by-n interpolation matrix φ, whose ji-th element is φ ji = φ( x j x i ), is nonsingular 4 RBF and Interpolation (cont d) Typical RBFs When m < N (m : number of hidden units; N : number of inputs), we can find w that minimizes E(w) = (F (x i ) d i ), where F (x) = P m k= w kφ k (x) The solution involves the pseudo inverse of φ: w = φ T φ φ T d {z } pseudo inverse Note: φ is an N m rectangular matrix In this case, how to determine the centers of the φ k ( ) functions becomes an issue 5 For some c >, σ >, and r R Multiquadrics (non-local): Inverse multiquadrics (local): Gaussian functions (local): φ(r) = (r + c ) / φ(r) = (r + c ) / φ(r) = exp 6 r σ «

Other Perspectives on RBF Learning RBF learning is formularized as F (x) = m X k= w k φ k (x) This kind of expression was given without much rationale, other than intuitive appeal However, there s one way to derive the above formalism based on an interesting theoretical point-of-view, which we will see next Supervised Learning as Ill-Posed Hypersurface Reconstruction Problem The exact interpolation approach has limitations: Poor generalization: data points being more numerous than the degree of freedom of the underlying process can lead to overfitting How to overcome this issue? Approach the problem from a perspective that learning is a hypersurface reconstruction problem given a sparse set of data points Contrast between direct problem (in many cases well-posed) vs inverse problem (in many cases ill-posed) 7 8 Well-Posed Problems in Reconstructing Functional Mapping Given an unknown mapping from domain X to range Y, we want to reconstruct the mapping f This mapping is well-posed if all the following conditions are satisfied: Existence: For all x X, there exist an output y Y such that y = f(x) Uniqueness: For all x, t X, f(x) = f(t) iff x = t Continuity: The mapping is continuous For any ɛ > there exists δ = δ(ɛ) such that ρ x (x, t) < δ ρ y (f(x), f(t)) < ɛ, where ρ(, ) is the distance measure If any of these conditions are violated, the problem is called an ill-posed problem Ill-Posed Problems and Solutions Direct (causal) mapping are generally well-posed (eg, 3D object to D projection) On the other hand, inverse problems are ill-posed (eg, reconstructing 3D structure from D projections) For ill-posed problems, solutions are not unique (can in many cases they can be infinite): We need prior knowledge (or some kind of preference) to narrow down the range of solutions (this is called regularization) Treating supervised learning as an ill-posed problem, and using certain prior knowledge, we can derive the RBF formalism 9

Regularization Theory: Overview The main idea behind regularization is to stabilize the solution by means of prior information This is done by including a functional (a function that maps from a function to a scalar) in the cost function, so that the functional can also be minimized Only a small number of candidate solutions will minimize this functional These functional terms are called the regularization term Typically, the functionals measure the smoothness of the function Tikonov s Regularization Theory Task: Given input x i R m and target d i R, find F (x) Minimize the sum of two terms: Standard error term: E s (F ) = (d i y i ) = (d i F (x i )) Regularization term: E c (F ) = DF, where D is a linear differential operator, and the norm of the function space Putting these together, we want to minimize (w/ regularization param λ) E(F ) = E s (F )+λe c (F ) = (d i F (x i )) + λ DF Error Term vs Regularization Term Solution that Minimizes E(F ) data fit data fit 5 data fit 5 5 5 Problem: minimize 5-5 3 4 5 6 7 RBFs 9995 999 9985 998 9975 997 5-5 3 4 5 6 7 RBFs 9 8 7 6 5 4 3 5 3 4 5 6 7 RBFs 8 6 4 E(F ) = E s (F )+λe c (F ) = (d i F (x i )) + λ DF Solution: F λ (x) that satisfies the Euler-Lagrange equation (below) minimizes E(F ) 9965 996 9955 3 4 5 6 7 3 4 5 6 7 3 4 5 6 7 eddf λ (x) λ [d i F (x i )] δ(x x i ) =, Error Regularization Left Bad fit Extremely smooth Middle Good fit Smooth where e D is the adjoint operator of D and δ( ) is the Dirac delta function Right Over fit Jagged Try this demo: http://lcnepflch/tutorial/english/rbf/html/ 3 4

Solution that Minimizes E(F ) (cont d) The solution to the Euler-Lagrange equation can be formulated in terms of the Green s function that satisfies: eddg(x, x ) = δ(x, x ) Note: the form of G(, ) depends on the particular choice of D Finally, the desired function F λ (x) that minimizes E(F ) is: Solution that Minimizes E(F ) (cont d) Letting w i = λ [d i F (x i )], i =,, N we can recast F λ (x) as F λ (x) = Plugging in input x j, we get w i G(x, x i ) F λ (x) = λ [d i F (x i )] G(x, x i ) 5 F λ (x j ) = Note the similarity to the RBF: w i G(x j, x i ) F (x) = w i φ( x x i ) 6 Solution that Minimizes E(F ) (cont d) We can use a matrix notation: F λ = [F λ (x ), F λ (x ), F λ (x N )] T d = [d, d,, d N ] T G(x, x ) G(x, x ) G(x, x N ) G(x, x ) G(x, x ) G(x, x N ) G = 6 4 G(x N, x ) G(x N, x ) G(x N, x N ) w = [w, w,, w N ] T Then we can rewrite the formula in the previous page as w = λ (d F λ), and F λ = Gw 3 7 5 Combining Solution that Minimizes E(F ) (cont d) we can eliminate F λ to get w = λ (d F λ) F λ = Gw (G + λi)w = d From this, we can get the weights: w = (G + λi) d if G + λi is invertible (it needs to be positive definite, which can be ensured by a large λ) Note: G(x i, x j ) = G(x j, x i ), thus G T = G 7 8

Solution that Minimizes E(F ) (cont d) Steps to follow: Determine D Find the Green s function matrix G associated with D 3 Obtain the weights by w = (G + λi) d For certain Ds, the corresponding Green s functions are Gaussian, inverse multiquadrics, etc Thus, the whole approach is similar to RBF (More on this later) Regularization Theory and RBF In conclusion, the regularization problem s solution is given by the expansion: F λ (x) = w i G(x, x i ), where G(, ) is the Green s function for the self-adjoint operator DD e When the stabilizer D has certain properties, the resulting Green s function G(, ) become radial basis functions: D is translationally invariant: G(x, x i ) = G(x x i ) D is translationally and rotationally invariant: G(x, x i ) = G( x x i ), which is a RBF! 9 3 Translationally and Rotationally Invariant D One example of a Green s function that corresponds to a translationally and rotationally invariant D is the multivariate Gaussian function: G(x, x i ) = exp «σi x x i With this, the regularized solution becomes F λ (x) = w i exp «σi x x i This function is known to be a universal approximator The regularized solution Regularization Networks F λ (x) = F(x) w w w N G G G G Hidden x x m can be represented as a network w i exp «σi x x i Hidden unit i computes G(x, x i ) (one hidden unit per input pattern) unit is linear weighted sum of hidden unit activation Problem: we need N hidden units for N input patterns, which can be excessive for large input sets 3 3

Desirable Properties of Regularization Networks The regularization network is a universal approximator, that can approximate any multivariate continuous function arbitrarily well, given sufficiently large number of hidden units The approximation shows the best approximation property (best coefficients will be found) The solution is optimal: it will minimize the cost functional E(F) Generalized Radial-Basis Function Networks w =b w w w m φ= φ φ φ φ Hidden x x m Regularization networks can be computationally demanding when N is huge Generalized RBF overcomes this problem m X F (x) = w i φ i (x), where m < N, and φ i (x) = G( x t i ), so that m X F (x) = w i G( x t i ) 33 34 Generalized Radial-Basis Function Networks (cont d) To find the weights w i, we minimize m! X E(F ) = d i w i G( x t i ) + λ DF Weighted Norm When individual input lines in the input vector x are from different classes, a weighted norm can be used x C = (Cx)T (Cx) = x T C T Cx Minimization of E(F ) yields The approximation function can be rewritten as G = 6 4 (G T G + λg )w = G T d, where G(t, t ) G(t, t ) G(t, t m ) G(t, t ) G(t, t ) G(t, t m ) G(t m, t ) G(t m, t ) G(t m, t m 3 7 5 m X F (x) = w i G( x t i C ) For Gaussian RBF, it can be interpreted as G( x t i C ) = exp ` (x t i ) T C T C(x t i ) = exp ` (x t i) T Σ (x t i ), As λ approaches, we get G T Gw = G T d, so, where t i represents the mean vector and Σ the covariance w = (G T G) G T d = G + d 35 matrix of a multivariate Gaussian function 36

Regularization Networks vs Generalized RBF F(x) w =b w w w N w w w m φ= G G G G Hidden φ φ φ φ Hidden x x m x x m Hidden layer in GRBF is much smaller: m < N In GRBF, () the weights, () the RBF centers t i, and (3) the norm weighting matrix are all unknown parameters to be determined In regularization networks, RBF centers are known (same as all the inputs), and only the weights need to be determined Estimating the Parameters Weights w i : already discussed (more next) Regularization parameter λ: Minimize averaged squared error Use generalized cross-validation RBF centers: Randomly select fixed centers Self-organized selection Supervised selection 37 38 Estimating the Regularization Parameter λ RBF Learning Minimize average squared error: For a fixed λ, for all N inputs, calculate the squared error betwen the true function value and the estimated RBF network output using the λ Find the optimal λ that minimizes this error Problem: This requires knowledge of the true function values Generalized cross-validation: Use leave-one-out cross validation With a fixed λ, for all N inputs, find the difference between the target value (from the training set) and the predicted value from the leave-one-out-trained network This approach depend only on the training set w =b w w w m φ= φ φ φ φ Hidden x x m Basic idea is to learn in two different time scales: Nonlinear, slow learning of the RBF parameters (center, variance) Linear, fast learning of the hidden-to-output weights 39 4

RBF Learning (/3): Random Centers Use m hidden units: G( x t i ) = exp m «x t d i, max where t i (i =,,, m ) are picked by random from the available inputs x j (j =,,, N) Note that the standard deviation (width) of the RBF is fixed to: σ = d max m, where d max is the max distance between the chosen centers t i This gives a width that is not too peaked nor too flat The linear weights are learned using the pseudoinverse: w = G + d = (G T G) G T d, where the matrix G = {g ji }, g ji = G( x j t i ) 4 Finding G + with Singular Value Decomposition If for a real N M matrix G, there exists orthogonal matrices U = [u, u,, u N ] V = [v, v,, v M ], such that U T GV = diag(σ, σ,, σ K ) = Σ, K = min(m, N), then U is called the left singular matrix, V the right singular matrix, and σ, σ,, σ K the singular values of the matrix G Once these are known, we can obtain G + as G + = VΣ + U T where Σ + = diag,, There are efficient σ σ σ K algorithms for singular value decomposition that can be used for this 4 Finding G + with Singular Value Decomposition Using these properties we can verify that GG + = I: (cont d) U = U T V = V T ΣΣ + = I U T GV = Σ UU T GVV T = UΣV T G = UΣV T GG + = UΣV T VΣ + U T = UΣΣ + U T = UU T = I 43 RBF Learning (/3): Self-Organized Centers The random-center approach is only effective with large input sets To overcome this, we can take a hybrid approach: () self-organized learning of centers, and () supervised learning of linear weights Clustering for RBF center learning (similar to Self-Organizing Maps): Initialization: Randomly choose distinct t k ()s Sampling: Draw a random input vector x X 3 Similarity matching: Find best-matching center vector t k(x) : k(x) = arg min x(n) t k (n) k 4 Updating: Update center vectors ( tk (n) + η[x(n) t k (n)], if k = k(x) t k (n+) = t k (n), 5 Continuation: increment n and repeat from step 44 otherwise

RBF Learning (3/3): Supervised Selection of Centers Use error correction learning to adjust all RBF parameters to minimize the error cost function: E = e j = d j F (x j ) = d j e j, j= MX w i G( x j t i Ci ) Linear weights (output layer): w i (n + ) = w i (n) η E(n) w i (n) Position of centers (hidden layer): t i (n + ) = t i (n) η E(n) t i (n) Spread of centers (hidden layer): Σ i (n + ) = Σ E(n) i (n) η 3 Σ (n) 45 i RBF Learning (3/3): Supervised Selection of Centers Linear weights E(n) w i (n) = Position of centers E(n) t i (n) Spread of centers E(n) Σ i (cont d) e j (n)g( x j t i (n) Ci ) j= = w i (n) e j (n)g x j t i (n) Ci Σ h i (n) x i j t i (n) j= (n) = w i(n) e j (n)g ` x j t i (n) Ci Qij (n), j= Q ji (n) = [x j t i (n)][x j t i (n)] T 46 Comparison of RBF and MLP RBF has a single hidden layer, while MLP can have many In MLP, hidden and output neurons have the same underlying function In RBF, they are specialized into distinct functions In RBF, the output layer is linear, but in MLP, all neurons are nonlinear The hidden neurons in RBF calculate the Euclidean norm of the input vector and the center, while in MLP the inner product of the input vector and the weight vector is calculated MLPs construct global approximations to nonlinear input output mapping RBF uses exponentially decaying localized nonlinearities (eg Gaussians) to construct local approximations to nonlinear input output mappings 47 Summary RBF network is unusual due to its two different unit types: RBF hidden layer, and linear output layer RBF is derived in a principled manner, starting from Tikohonov s regularization theory, unlike MLP In RBF, the smoothing term becomes important, with different D giving rise to different Green s function G(, ) Generalized RBF lifts the requirement of N hidden units for N input patterns, greatly reducing the computational complexity Proper estimation of the regularization parameter λ is needed 48