Introduction to Neural Networks: Structure and Training

Introduction to Neural Networks: Structure and Training Professor Q.J. Zhang Department of Electronics Carleton University, Ottawa, Canada www.doe.carleton.ca/~qjz, qjz@doe.carleton.ca

A Quick Illustration Example: Neural Network Model for Delay Estimation in a High-Speed Interconnect Network

High-Speed VLSI Interconnect Network Receiver 1 Driver 1 Driver 2 Receiver 4 Driver 3 Receiver 3 Receiver 2

Circuit Representation of the Interconnect Network L1 τ1 L2 τ2 L3 τ3 Rs R3 C3 L4 Source Vp, Tr R1 C1 R2 C2 τ4 R4 C4

Massive Analysis of Signal Delay

Need for a Neural Network Model A PCB contains large number of interconnect networks, each with different interconnect lengths, terminations, and topology, leading to need of massive analysis of interconnect networks During PCB design/optimization, the interconnect networks need to be adjusted in terms of interconnect lengths, receiver-pin load characteristics, etc, leading to need of repetitive analysis of interconnect networks This necessitates fast and accurate interconnect network models and neural network model is a good candidate

Neural Network Model for Delay Analysis τ1 τ2 τ3 τ4... L1 L2 L3 L4 R1 R2 R3 R4 C1 C2 C3 C4 Rs Vp Tr e1 e2 e3

3 Layer MLP: Feedforward Computation Outputs y 1 y 2 y j =Σ W jkz k k W jk Hidden Neuron Values Z 1 Z 2 Z 3 Z 4 W ki Z k = tanh(σ W ki x i ) i x 1 x 2 x 3 Inputs

Neural Net Training Objective: Training Data by simulation/ measurement d = d(x) Neural Network y = y(x) to adjust W,V such that minimize Σ (y( - d) 2 W,V x

Simulation Time for 20,000 Interconnect Configurations Method CPU Circuit Simulator (NILT) 34.43 hours AWE 9.56 hours Neural Network Approach 6.67 minutes

Important Features of Neural Networks Neural networks have the ability to model multi-dimensional nonlinear relationships Neural models are simple and the model computation is fast Neural networks can learn and generalize from available data thus making model development possible even when component formulae are unavailable Neural network approach is generic, i.e., the same modeling technique can be re-used for passive/active devices/circuits It is easier to update neural models whenever device or component technology changes

Inspiration Mary Lisa John Stop Start Help

A Biological Neuron Figure from Reference [L.H. Tsoukalas and R.E. Uhrig, Fuzzy and Neural Approaches in Engineering, Wiley, 1997.]

Neural Network Structures

Neural Network Structures A neural network contains neurons (processing elements) connections (links between neurons) A neural network structure defines how information is processed inside a neuron how the neurons are connected Examples of neural network structures multi-layer perceptrons (MLP) radial basis function (RBF) networks wavelet networks recurrent neural networks knowledge based neural networks MLP is the basic and most frequently used structure

MLP Structure y 1 y 2 y m 1 2 N.... L (Output) Layer L 1 2 3.... N L-1 (Hidden) Layer L-1....... 1 2 3.... N 2 (Hidden) Layer 2 1 2 3.... N 1 (Input) Layer 1 x 1 x 2 x 3 x n

Information Processing In a Neuron l z i σ (.) γ l γ i z l 1 0 z l wi 0 l 1 1 l wi 1 z l 1 2 l wi 2 Σ. l win l 1 l 1 z N l 1

Neuron Activation Functions Input layer neurons simply relay the external inputs to the neural network Hidden layer neurons have smooth switch-type activation functions Output layer neurons can have simple linear activation functions

Forms of Activation Functions: z = σ(γ) 1.5 Sigmoid 1 0.5 0-25 -20-15 -10-5 0 5 10 15 20 25 2 1 Arctangent 0-1 -2-10 -8-6 -4-2 0 2 4 6 8 10 2 Hyperbolic tangent 1 0-1 -2-10 -8-6 -4-2 0 2 4 6 8 10

Forms of Activation Functions: z = σ(γ) Sigmoid function: z = σ ( γ ) = 1 1 + e γ Arctangent function: 2 z = σ ( γ ) = arctan( γ ) π Hyperbolic Tangent function: e z = σ ( γ ) = e + γ + γ e + e γ γ

Structure: Multilayer Perceptrons (MLP): 1 2 N.... L (Output) layer L 1 2 3.... N L-1 (Hidden) layer l....... 1 2 3.... N 2 (Hidden) layer 2 1 2 3.... N 1 (Input) layer 1 x 1 x 2 x 3 x n

where: L = Total No. of the layers N l = No. of neurons in the layer #l l = 1, 2, 3,, L l w ij = Link (weight) between the neuron #i in the layer #l and the neuron #j in the layer #(l-1) NN inputs = x, x2,, (where n = N 1 ) 1 L NN outputs = y, y2,, (where m = N L ) 1 L Let the neuron output value be represented by z, l = Output of neuron #i in the layer #l z i Each neuron will have an activation function: z = σ (γ ) x n y m

Neural Network Feedforward: Problem statement: x x M 1 2 Given: x =, get y = from NN. x n y y M y m Solution: feed to layer #1, and feed outputs of layer #l-1 to l. 1 2 For l l i i = = 1 : z = x, i = 1,2, L, n, n N 1 N l = 1 l l l 1 l l i ij j i i ) j= 0 For l = 2,3, L, L : γ w z, z = σ ( γ and solution is i L i y = z, i = 1,2, L, m, m = N L

Question: How can NN represent an arbitrary nonlinear input-output relationship? Summary in plain words: Given enough hidden neurons, a 3-layer-perceptron can approximate an arbitrary continuous multidimensional function to any required accuracy.

Theorem (Cybenko, 1989): Let σ (.) be any continuous sigmoid function, the finite sums of the form: N 2 ( 3 ) n j= 1 kj i= 0 ( 2 ) ji yk = fk ( x ) = w σ ( w xi ) k = 1,2, L, m are dense in C(I n ). In other words, given any f C( In ) and ε > 0, there is a sum, f (x), of the above form, for which ( x ) f ( x ) < ε for all x I f where: I n -- n-dimensional unit cube [ 0, 1], i = 1,2, n x space: xi L, C(I n ): Space of continuous functions on I n. e.g., Original problem is y = f ( x), where f C( I n ), the form of f (x) is a 3-layer-perceptron NN. n [ 0, 1] n

Illustration of the Effect of Neural Network Weights 1.2 1 0.8 0.6 Standard signmoid function 1/(1 + e x ) 0.4 0.2 0-6 -4-2 0 2 4 6 2.5 2 1.5 1 bias: -2.5 2.0 0.5 Hidden neuron 1/(1 + e (0.5x 2.5) ) 0.5 0-10 -5 0 5 10 15 20 Suppose this neuron is z il. Weights w ijl (or w ki l+1 ) affect the figure horizontally (or vertically). Values of the w s are not unique, e.g., by changing signs of the 3 values in the example above.

Question: How many neurons are needed? Essence: The degree of non-linearity in original problem. Highly nonlinear problems need more neurons, more smooth problems need fewer neurons. Too many neurons may lead to over learning Too few neurons will not represent problem well enough Solution: experience trial/error adaptive schemes

Question: How many layers are needed? 3 or more layers are necessary and sufficient for arbitrary nonlinear approximation 3 or 4 layers are used more frequently more layers allow more effective representation of hierarchical information in the original problem

Radial Basis Function Network (RBF) Structure: y 1 y m w ij Z 1 Z 2 Z N λ ij,c ij x 1 x n

RBF Function (multi-quadratic) 1.5 1 σ(γ) 0.5 a = 0.4 0 a = 0.9-10 -6-2 2 6 10 γ 1 σ ( γ ) = α > 0 2 2 α ( c + γ ),

RBF Function (Gaussian) 1.2 1 σ(γ) 0.8 0.6 0.4 0.2 λ=1 λ=0.2 0-4 -3-2 -1 0 1 2 3 4-0.2 γ σ ( γ ) = exp( ( γ / λ ) 2 )

RBF Feedforward: y where cij k γ = i N 3 w ki z i i = 0 z i = σ ( γ i n x j ( j= 1 λ ij = ) = c ij ) e 2 γ i k = 1,2, L, is the center of radial basis function, λ is the factor. So a set of parameters c ij (i = 1, 2,, N, j = 1,, n) represent centers of the RBF. Parameters λ ij, (i = 1, 2,, N, j = 1,, n) represent standard deviation of the RBF. m i = 1, 2, L, N, γ 0 = 0 ij Universal Approximation Theorem (RBF) (Krzyzak, Linder & Lugosi, 1996): An RBF network exists such that it will approximate an arbitrary continuous y = f (x) to any accuracy required.

Wavelet Neural Network Structure: y 1 y m w ij Z 1 Z 2 Z N a 1 a 2 a N t ij x 1 x t j For hidden neuron z j: z j = σ ( γ j) = ψ ( ) a where t j = [ t, t j1 j2,..., t jn ]T is the translation vector, is the dilation factor, ψ ( ) is a wavelet function: ψ ( γ j ) 2 γ j 2 x t σ ( γ ) ( n) e 2 j n xi t ji 2 = j = γ j where γ j = = ( ) a j i= 1 a j x n j a j

Wavelet Function 1 z 0.5 0 a1 = 1-0.5 a1 = 5-1 -20-10 0 10 20 x

Wavelet Transform: R space of a real variable n R -- n-dimensional space, i.e. space of vectors of n real variables A function f : R n R is radial, if a function g : R R, n exists such that x R, f ( x) = g( x ) If ψ(x) is radial, its Fourier Transform ψˆ ( w) is also radial. Let ψ ˆ( w) = η ( w ), ψ(x) is a wavelet function, if = 2 n η ( ξ ) C ψ (2π ) dξ < 0 ξ Wavelet transform of f (x) is: n x t w( a, t) f ( x) a 2 = ψ ( a Inverse transform is: R n ) dx 1 ( n+ 1) x t f ( x) a n w( a, t) a 2 = ( ) dtda C 0 ψ R a ψ n

Feedforward Neural Networks The neural network accepts the input information sent to input neurons, and proceeds to produce the response at the output neurons. There is no feedback from neurons at layer l back to neurons at layer k, k l. Examples of feedforward neural networks: Multilayer Perceptrons (MLP) Radial Basis Function Networks (RBF) Wavelet Networks

Recurrent Neural Network (RNN): Discrete Time Domain The neural network output is a function of its present input, and a history of its input and output. y(t). Feedforwad neural network τ y(t-τ) τ τ τ τ y(t-2τ) y(t-3τ) x(t-τ) x(t-2τ) x(t)

Dynamic Neural Networks (DNN) (continuous time domain) The neural network directly represents the dynamic input-output relationship of the problem. The input-output signals and their time derivatives are related through a feedforward network. y (t). Feedforwad neural network y (t) y (t) y(t) x (t) x (t) x(t)

Self-Organizing Maps (SOM), (Kohonen, 1984) Clustering Problem: Given training data, p = 1, 2,, P, Find cluster centers, k = 1, 2,, N Basic Clustering Algorithm: For each cluster,, initialize an index set: 1 ck = x p size( R ) k p R k R k = { empty}, then x p x p For, p = 1, 2,, P, find the cluster that is closest to,. i.e. find c, such that, i { } c i x < c x, j, Let Ri = Ri p, 1 Then the center is adjusted: ci = size( Ri ) p continues until c i does not move any more. p j p R i x p i j and the process

Self Organizing Maps (SOM) SOM is a one or two dimensional array of neurons where the neighboring neurons in the map correspond to the neighboring cluster centers in the input data space. Principle of Topographic Map Information: (Kohonen, 1990) The spatial location of an output neuron in the topographic map corresponds to a particular domain or features of the input data Two-dimensional array of neurons input

Training of SOM: For each training sample, p = 1, 2,, P, find the nearest cluster,, such that c ij Then update x p cij x p < ckl x p, k, l c kl c ij and neighboring centers: = ckl + α ( t)( x p ckj ) where k i < Nc, l j < Nc Nc size of neighborhood α (t) -- a positive value decaying as training proceeds

Filter Clustering Example (Burrascano et al) x Specific MLP for group 1 y x x Specific MLP for group 2 y y Control x Specific MLP for group 3 y General MLP x Specific MLP for group 4 y Q Control SOM c

Other Advanced Structures: Knowledge Based Neural Networks embedding application specific knowledge into networks Empirical formula Equivalent Circuit + Pure Neural Networks