Slide05 Haykin Chapter 5: Radial-Basis Function Networks

Slide5 Haykin Chapter 5: Radial-Basis Function Networks CPSC 636-6 Instructor: Yoonsuck Choe Spring Learning in MLP Supervised learning in multilayer perceptrons: Recursive technique of stochastic approximation, eg, backprop Design of nnet as a curve-fitting (approximation) problem, eg, RBF Curve-fitting: Finding a surface in a multidimensional space that provides a best fit to the training data Best fit measured in a certain statistical sense RBF is an example: hidden neurons forming an arbitrary basis for the input patterns when they are expanded into the hidden space These basis are called radial basis functions Radial-Basis Function Networks Cover s Theorem W φ φ φ 3 φ 4 Hidden Cover s theorem on the separability of patterns: A complex pattern-classification problem cast in a high-dimensional space nonlinearly is more likely to be linearly separable than in a low-dimensional space Three layers: Hidden: nonlinear transformation from input to hidden space : linear activation Principal motivation: Cover s theorem pattern classifiction casted in high-dimensional space is more likely to be linearly separable than in low-dimensional space 3 Basic idea: nonlinearly map points in the input space to a hidden space that has a higher dimension than the input space Once the proper mapping is done, simple, quick algorithms can be used to find the separating hyperplane 4

φ-separability of Patterns N input patterns X = {x, x,, x N } in m -dimensional space The inputs belong to either of two sets X and X : they form a dichotomy The dichotomy is separable wrt a family of surfaces if a surface exists in the family that separates the points in class X from X For each x X, define an m -vector {φ i (x) i =,,, m }: φ(x) = [φ (x), φ (x),, φ m (x)] T that maps inputs in m -D space to the hidden space of m -D φ i (x) are called the hidden functions, and the space spanned by these functions is called the hidden space or feature space A dichotomy is φ-separable if there exists an m -D vector w such that: w T φ(x) >, x X w T φ(x) <, x X with separating hyperplane w T φ(x) = 5 Cover s Theorem Revisited Given a set X of N inputs picked from the input space independently, and suppose all the possible dichotomies of X are equiprobable Let P (N, m ) denote the probability that a particular dichotomy picked at random is φ-separable, where the family of surfaces has m degrees of freedom In this case, where «N m X P (N, m ) = l m! = m= N m l(l )(l )(l m + ) 6 m!! Cover s Theorem: Interpretation Example: XOR (again!) Separability depends on: () particular dichotomy, and () the distribution of patterns in the input space The derived P (N, m ) states that the probability of being φ-separable is equivalent to the cumulative binomial distribution corresponding to the probability that (N ) flips of a fair coin will result in (m ) or fewer heads In sum, Cover s theorem has two basic ingredients: Nonlinear mapping to hidden space with φ i (x) (i =,,, m ) With Gaussian hidden functions, the inputs become linearly separable in the hidden space: High dimensionality of hidden space compared to the input space (m > m ) Corollary: A maximum of m patterns can be linearly separated by a hidden space of m -D 7 φ (x) = exp( x t ), φ (x) = exp( x t ), 8 t = [, ] T t = [, ] T

D Gaussian RBF Learning: Overview 9 8 7 6 5 4 3 exp(-(x*x+y*y)) 9 8 7 6 5 4 3 W φ φ φ 3 φ 4 Hidden -3-3 - - - 3-3 Basically, t i determines the center of the D Gaussian, φ i (x) = exp( x t i ) and points x that are equidistance from the center have the same φ value 9 The RBF learning problem boils down to two tasks: How to determine the parameters associated with the radial-basis functions in the hidden layer φ i (x) (eg, the center of the Gaussians) How to train the hidden-to-output weights?: This part is relatively easy RBF as an Interpolation Problem RBF and Interpolation W φ φ φ 3 φ 4 Hidden m -D input to -D output mapping s : R m R The map s can be thought of as a hypersurface Γ R m + Training: fit hypersurface Γ to the training data points Generalization: interpolate between data points, along the reconstructed surface Γ Given {x i R m i =,,, N} and N labels {d i R i =,,, N}, find F : R N R such that F (x i ) = d i forall i Interpolation is formulated as F (x) = w i φ( x x i ), where {φ( x x i ) i =,,, N} is a set of N arbitrary (nonlinear) functions known as radial-basis functions The known data points x i R m, i =,,, N are treated as the centers of the RBFs (Note that in this case, all input data need to be memorized, as in instance-based learning, but this is not a necessary requirement)

RBF and Interpolation (cont d) RBF and Interpolation (cont d) F (x) = w i φ( x x i ), So, we have N inputs and N hidden units, and one output unit Expressing everything (all N input-output pairs) in matrix form: 3 3 φ φ φ N w φ φ φ N w = 6 4 7 6 5 4 7 6 5 4 φ N φ N φ NN w N where φ ji = φ( above as: x j {z} x i {z} Center φw = d 3 d d d N ) We can abbreviate the 3, 7 5 From φw = d, we can find an explicit solution w: w = φ d, assuming φ is nonsingular (Note: in general, the number of hidden units is much less than the number of inputs, so we don t always have φ as a square matrix! We ll see how to handle this, later) Nonsingularity of φ is guaranteed by Micchelli s theorem: Let {x i } N be a set of distinct points in Rm Then the N -by-n interpolation matrix φ, whose ji-th element is φ ji = φ( x j x i ), is nonsingular 4 RBF and Interpolation (cont d) Typical RBFs When m < N (m : number of hidden units; N : number of inputs), we can find w that minimizes E(w) = (F (x i ) d i ), where F (x) = P m k= w kφ k (x) The solution involves the pseudo inverse of φ: w = φ T φ φ T d {z } pseudo inverse Note: φ is an N m rectangular matrix In this case, how to determine the centers of the φ k ( ) functions becomes an issue 5 For some c >, σ >, and r R Multiquadrics (non-local): Inverse multiquadrics (local): Gaussian functions (local): φ(r) = (r + c ) / φ(r) = (r + c ) / φ(r) = exp 6 r σ «

Other Perspectives on RBF Learning RBF learning is formularized as F (x) = m X k= w k φ k (x) This kind of expression was given without much rationale, other than intuitive appeal However, there s one way to derive the above formalism based on an interesting theoretical point-of-view, which we will see next Supervised Learning as Ill-Posed Hypersurface Reconstruction Problem The exact interpolation approach has limitations: Poor generalization: data points being more numerous than the degree of freedom of the underlying process can lead to overfitting How to overcome this issue? Approach the problem from a perspective that learning is a hypersurface reconstruction problem given a sparse set of data points Contrast between direct problem (in many cases well-posed) vs inverse problem (in many cases ill-posed) 7 8 Well-Posed Problems in Reconstructing Functional Mapping Given an unknown mapping from domain X to range Y, we want to reconstruct the mapping f This mapping is well-posed if all the following conditions are satisfied: Existence: For all x X, there exist an output y Y such that y = f(x) Uniqueness: For all x, t X, f(x) = f(t) iff x = t Continuity: The mapping is continuous For any ɛ > there exists δ = δ(ɛ) such that ρ x (x, t) < δ ρ y (f(x), f(t)) < ɛ, where ρ(, ) is the distance measure If any of these conditions are violated, the problem is called an ill-posed problem Ill-Posed Problems and Solutions Direct (causal) mapping are generally well-posed (eg, 3D object to D projection) On the other hand, inverse problems are ill-posed (eg, reconstructing 3D structure from D projections) For ill-posed problems, solutions are not unique (can in many cases they can be infinite): We need prior knowledge (or some kind of preference) to narrow down the range of solutions (this is called regularization) Treating supervised learning as an ill-posed problem, and using certain prior knowledge, we can derive the RBF formalism 9

Regularization Theory: Overview The main idea behind regularization is to stabilize the solution by means of prior information This is done by including a functional (a function that maps from a function to a scalar) in the cost function, so that the functional can also be minimized Only a small number of candidate solutions will minimize this functional These functional terms are called the regularization term Typically, the functionals measure the smoothness of the function Tikonov s Regularization Theory Task: Given input x i R m and target d i R, find F (x) Minimize the sum of two terms: Standard error term: E s (F ) = (d i y i ) = (d i F (x i )) Regularization term: E c (F ) = DF, where D is a linear differential operator, and the norm of the function space Putting these together, we want to minimize (w/ regularization param λ) E(F ) = E s (F )+λe c (F ) = (d i F (x i )) + λ DF Error Term vs Regularization Term Solution that Minimizes E(F ) data fit data fit 5 data fit 5 5 5 Problem: minimize 5-5 3 4 5 6 7 RBFs 9995 999 9985 998 9975 997 5-5 3 4 5 6 7 RBFs 9 8 7 6 5 4 3 5 3 4 5 6 7 RBFs 8 6 4 E(F ) = E s (F )+λe c (F ) = (d i F (x i )) + λ DF Solution: F λ (x) that satisfies the Euler-Lagrange equation (below) minimizes E(F ) 9965 996 9955 3 4 5 6 7 3 4 5 6 7 3 4 5 6 7 eddf λ (x) λ [d i F (x i )] δ(x x i ) =, Error Regularization Left Bad fit Extremely smooth Middle Good fit Smooth where e D is the adjoint operator of D and δ( ) is the Dirac delta function Right Over fit Jagged Try this demo: http://lcnepflch/tutorial/english/rbf/html/ 3 4

Solution that Minimizes E(F ) (cont d) The solution to the Euler-Lagrange equation can be formulated in terms of the Green s function that satisfies: eddg(x, x ) = δ(x, x ) Note: the form of G(, ) depends on the particular choice of D Finally, the desired function F λ (x) that minimizes E(F ) is: Solution that Minimizes E(F ) (cont d) Letting w i = λ [d i F (x i )], i =,, N we can recast F λ (x) as F λ (x) = Plugging in input x j, we get w i G(x, x i ) F λ (x) = λ [d i F (x i )] G(x, x i ) 5 F λ (x j ) = Note the similarity to the RBF: w i G(x j, x i ) F (x) = w i φ( x x i ) 6 Solution that Minimizes E(F ) (cont d) We can use a matrix notation: F λ = [F λ (x ), F λ (x ), F λ (x N )] T d = [d, d,, d N ] T G(x, x ) G(x, x ) G(x, x N ) G(x, x ) G(x, x ) G(x, x N ) G = 6 4 G(x N, x ) G(x N, x ) G(x N, x N ) w = [w, w,, w N ] T Then we can rewrite the formula in the previous page as w = λ (d F λ), and F λ = Gw 3 7 5 Combining Solution that Minimizes E(F ) (cont d) we can eliminate F λ to get w = λ (d F λ) F λ = Gw (G + λi)w = d From this, we can get the weights: w = (G + λi) d if G + λi is invertible (it needs to be positive definite, which can be ensured by a large λ) Note: G(x i, x j ) = G(x j, x i ), thus G T = G 7 8

Solution that Minimizes E(F ) (cont d) Steps to follow: Determine D Find the Green s function matrix G associated with D 3 Obtain the weights by w = (G + λi) d For certain Ds, the corresponding Green s functions are Gaussian, inverse multiquadrics, etc Thus, the whole approach is similar to RBF (More on this later) Regularization Theory and RBF In conclusion, the regularization problem s solution is given by the expansion: F λ (x) = w i G(x, x i ), where G(, ) is the Green s function for the self-adjoint operator DD e When the stabilizer D has certain properties, the resulting Green s function G(, ) become radial basis functions: D is translationally invariant: G(x, x i ) = G(x x i ) D is translationally and rotationally invariant: G(x, x i ) = G( x x i ), which is a RBF! 9 3 Translationally and Rotationally Invariant D One example of a Green s function that corresponds to a translationally and rotationally invariant D is the multivariate Gaussian function: G(x, x i ) = exp «σi x x i With this, the regularized solution becomes F λ (x) = w i exp «σi x x i This function is known to be a universal approximator The regularized solution Regularization Networks F λ (x) = F(x) w w w N G G G G Hidden x x m can be represented as a network w i exp «σi x x i Hidden unit i computes G(x, x i ) (one hidden unit per input pattern) unit is linear weighted sum of hidden unit activation Problem: we need N hidden units for N input patterns, which can be excessive for large input sets 3 3

Desirable Properties of Regularization Networks The regularization network is a universal approximator, that can approximate any multivariate continuous function arbitrarily well, given sufficiently large number of hidden units The approximation shows the best approximation property (best coefficients will be found) The solution is optimal: it will minimize the cost functional E(F) Generalized Radial-Basis Function Networks w =b w w w m φ= φ φ φ φ Hidden x x m Regularization networks can be computationally demanding when N is huge Generalized RBF overcomes this problem m X F (x) = w i φ i (x), where m < N, and φ i (x) = G( x t i ), so that m X F (x) = w i G( x t i ) 33 34 Generalized Radial-Basis Function Networks (cont d) To find the weights w i, we minimize m! X E(F ) = d i w i G( x t i ) + λ DF Weighted Norm When individual input lines in the input vector x are from different classes, a weighted norm can be used x C = (Cx)T (Cx) = x T C T Cx Minimization of E(F ) yields The approximation function can be rewritten as G = 6 4 (G T G + λg )w = G T d, where G(t, t ) G(t, t ) G(t, t m ) G(t, t ) G(t, t ) G(t, t m ) G(t m, t ) G(t m, t ) G(t m, t m 3 7 5 m X F (x) = w i G( x t i C ) For Gaussian RBF, it can be interpreted as G( x t i C ) = exp ` (x t i ) T C T C(x t i ) = exp ` (x t i) T Σ (x t i ), As λ approaches, we get G T Gw = G T d, so, where t i represents the mean vector and Σ the covariance w = (G T G) G T d = G + d 35 matrix of a multivariate Gaussian function 36

Regularization Networks vs Generalized RBF F(x) w =b w w w N w w w m φ= G G G G Hidden φ φ φ φ Hidden x x m x x m Hidden layer in GRBF is much smaller: m < N In GRBF, () the weights, () the RBF centers t i, and (3) the norm weighting matrix are all unknown parameters to be determined In regularization networks, RBF centers are known (same as all the inputs), and only the weights need to be determined Estimating the Parameters Weights w i : already discussed (more next) Regularization parameter λ: Minimize averaged squared error Use generalized cross-validation RBF centers: Randomly select fixed centers Self-organized selection Supervised selection 37 38 Estimating the Regularization Parameter λ RBF Learning Minimize average squared error: For a fixed λ, for all N inputs, calculate the squared error betwen the true function value and the estimated RBF network output using the λ Find the optimal λ that minimizes this error Problem: This requires knowledge of the true function values Generalized cross-validation: Use leave-one-out cross validation With a fixed λ, for all N inputs, find the difference between the target value (from the training set) and the predicted value from the leave-one-out-trained network This approach depend only on the training set w =b w w w m φ= φ φ φ φ Hidden x x m Basic idea is to learn in two different time scales: Nonlinear, slow learning of the RBF parameters (center, variance) Linear, fast learning of the hidden-to-output weights 39 4

RBF Learning (/3): Random Centers Use m hidden units: G( x t i ) = exp m «x t d i, max where t i (i =,,, m ) are picked by random from the available inputs x j (j =,,, N) Note that the standard deviation (width) of the RBF is fixed to: σ = d max m, where d max is the max distance between the chosen centers t i This gives a width that is not too peaked nor too flat The linear weights are learned using the pseudoinverse: w = G + d = (G T G) G T d, where the matrix G = {g ji }, g ji = G( x j t i ) 4 Finding G + with Singular Value Decomposition If for a real N M matrix G, there exists orthogonal matrices U = [u, u,, u N ] V = [v, v,, v M ], such that U T GV = diag(σ, σ,, σ K ) = Σ, K = min(m, N), then U is called the left singular matrix, V the right singular matrix, and σ, σ,, σ K the singular values of the matrix G Once these are known, we can obtain G + as G + = VΣ + U T where Σ + = diag,, There are efficient σ σ σ K algorithms for singular value decomposition that can be used for this 4 Finding G + with Singular Value Decomposition Using these properties we can verify that GG + = I: (cont d) U = U T V = V T ΣΣ + = I U T GV = Σ UU T GVV T = UΣV T G = UΣV T GG + = UΣV T VΣ + U T = UΣΣ + U T = UU T = I 43 RBF Learning (/3): Self-Organized Centers The random-center approach is only effective with large input sets To overcome this, we can take a hybrid approach: () self-organized learning of centers, and () supervised learning of linear weights Clustering for RBF center learning (similar to Self-Organizing Maps): Initialization: Randomly choose distinct t k ()s Sampling: Draw a random input vector x X 3 Similarity matching: Find best-matching center vector t k(x) : k(x) = arg min x(n) t k (n) k 4 Updating: Update center vectors ( tk (n) + η[x(n) t k (n)], if k = k(x) t k (n+) = t k (n), 5 Continuation: increment n and repeat from step 44 otherwise

RBF Learning (3/3): Supervised Selection of Centers Use error correction learning to adjust all RBF parameters to minimize the error cost function: E = e j = d j F (x j ) = d j e j, j= MX w i G( x j t i Ci ) Linear weights (output layer): w i (n + ) = w i (n) η E(n) w i (n) Position of centers (hidden layer): t i (n + ) = t i (n) η E(n) t i (n) Spread of centers (hidden layer): Σ i (n + ) = Σ E(n) i (n) η 3 Σ (n) 45 i RBF Learning (3/3): Supervised Selection of Centers Linear weights E(n) w i (n) = Position of centers E(n) t i (n) Spread of centers E(n) Σ i (cont d) e j (n)g( x j t i (n) Ci ) j= = w i (n) e j (n)g x j t i (n) Ci Σ h i (n) x i j t i (n) j= (n) = w i(n) e j (n)g ` x j t i (n) Ci Qij (n), j= Q ji (n) = [x j t i (n)][x j t i (n)] T 46 Comparison of RBF and MLP RBF has a single hidden layer, while MLP can have many In MLP, hidden and output neurons have the same underlying function In RBF, they are specialized into distinct functions In RBF, the output layer is linear, but in MLP, all neurons are nonlinear The hidden neurons in RBF calculate the Euclidean norm of the input vector and the center, while in MLP the inner product of the input vector and the weight vector is calculated MLPs construct global approximations to nonlinear input output mapping RBF uses exponentially decaying localized nonlinearities (eg Gaussians) to construct local approximations to nonlinear input output mappings 47 Summary RBF network is unusual due to its two different unit types: RBF hidden layer, and linear output layer RBF is derived in a principled manner, starting from Tikohonov s regularization theory, unlike MLP In RBF, the smoothing term becomes important, with different D giving rise to different Green s function G(, ) Generalized RBF lifts the requirement of N hidden units for N input patterns, greatly reducing the computational complexity Proper estimation of the regularization parameter λ is needed 48