METU Informatics Institute Min720 Pattern Classification ith Bio-Medical Applications Part 8: Neural Netors
- INTRODUCTION: BIOLOGICAL VS. ARTIFICIAL Biological Neural Netors A Neuron: - A nerve cell as a part of nervous system and the brain (Figure: http://hubpages.com/hub/self-affirmations)
Biological Neural Netors - There are 0 billion neurons in human brain. - A huge number of connections - All tass such as thining, reasoning, learning and recognition are performed by the information storage and transfer beteen neurons - Each neuron fires sufficient amount of electric impulse is received from other neurons. - The information is transferred through successive firings of many neurons through the netor of neurons.
Artificial Neural Netors An artificial NN, or ANN or (a connectionist model, a neuromorphic system) is meant to be - A simple, computational model of the biological NN. - A simulation of above model in solving problems in pattern recognition, optimization etc. - A parallel architecture of simple processing elements connected densely.
Y Y 2 a neuron Y, Y 2 outputs X, X 2 inputs neuron eights X X 2 An Artificial Neural Net
Any application that involves - Classification - Optimization - Clustering - Scheduling - Feature Extraction may use ANN! Most of the time integrated ith other methods such as - Expert systems - Marov models WHY ANN? Easy to implement Self learning ability When parallel architectures are used, very fast. Performance at least as good as other approaches, in principle they provide nonlinear discriminants, so solve any P.R. problem. Many boards and softare available
APPLICATION AREAS: - Character Recognition - Speech Recognition - Texture Segmentation - Biomedical Problems (Diagnosis) - Signal and Image Processing (Compression) - Business (Accounting, Mareting, Financial Analysis)
Bacground: Pioneering Wor 940 - McCulloch-Pitts (Earliest NN models) 950 - Hebb- (Learning rule, 949) - Rosenblatt(Perceptrons, 958-62) 960 -Widro, Hoff (Adaline, 960) 970 980 - Hopfield, Tam (Hopfield Model, 985) - RumelHart, McClelland (Bacpropagation) - Kohnen (Kohonen s nets, 986) 990 - Grossberg, Carpenter (ART) 90 s Higher order NN s, time-delay NN, recurrent NN s, radial basis function NN -Applications in Finance, Engineering, etc. - Well-accepted method of classification and optimization. 2000 s Becoming a bit outdated.
ANN Models: Can be examined in - Single Neuron Model 2-Topology 3- Learning X - Single Neuron Model: 0 Y N X N f(α) α f (α) f(α) + + α - - linear Step(bipolar) Sigmoid α General Model: α Y f N ( x + ) f( α) i i i o f ( net)
f (α) - Activation function Binary threshold / Bipolar / Hardlimiter Sigmoid When d, Mc Culloch-Pitts Neuron: - Binary Activation f( α) /(+ exp( αd) df dα f( f) - All eights of positive activations and negative activations are the same. Excitatory(E) Inhibitory(I) Fires only if E T, I 0 here E excit. inputs I inb. inputs TThreshold
Higher-Order Neurons: The input to the threshold unit is not a linear but a multiplicative function of eights. For example, a second-order neuron has a threshold logic ith α N x + + ith binary inputs. i i o i i, N More poerful than traditional model. i x i x 2. NN Topologies: 2 basic types: - Feedforard - Recurrent loops alloed Both can be single layer or many successive layers.
Youtput vector TTarget output vector Xinput vector A feed-forard net A recurrent net
3.Learning: Means finding the eights using the input samples so that the input-output pairs behave as desired. supervised- samples are labeled (of non category) P(X,T) input-target output pair unsupervised- samples are not labeled. Learning in general is attained by iteratively modifying the eights. Can be done in one step or a large no of steps. Hebb s rule: If to interconnected neurons are both on at the same time, the eight beteen them should be increased (by the product of the neuron values). Single pass over the training data (ne)(old)+xy Fixed-Increment Rule (Perceptron): - More general than Hebb s rule iterative ( ne) old ( ) - (change only if error occurs.) t target value assumed to be (if desired), 0 (if not desired). σ is the learning rate. + σxt
Delta Rule: Used in multilayer perceptrons. Iterative. ( ne) ( old) + σ( t y) x X Y here t is the target value and the y is the obtained value. ( t is assumed to be continuous) Assumes that the activation function is identity. σ( t y) x ( ne) ( old) Extended Delta Rule: Modified for a differentiable activation function. σ( t y) xf'( α)
PATTERN RECOGNITION USING NEURAL NETS A neural netor (connectionist system) imitate the neurons in human brain. In human brain there are 0 3 neurons. A neural net model 2 outputs 3 inputs Each processing element either fires or it does not fire W i eights beteen neurons and inputs to the neurons.
The model for each neuron: X X 2 0 Y Y X n n n f( x i i) f( x i i ) i 0 i f- activation function, normally nonlinear n f ( 0 α ) Hard-limiter + - α
Sigmoid + α Sigmoid f - + e α) α ( TOPOLOGY: Ho neurons are connected to each other. Once the topology is determined, then the eights are to be found, using learning samples. The process of finding the eights is called the learning algorithm. Negative eights inhibitory Positive eights - excitatory
Ho can a NN be used for Pattern Classification? - Inputs are feature vectors - Each output represent one category. - For a given input, one of the outputs fire (The output that gives you the highest value). So the input sample is classified to that category. Many topologies used for P.R. - Hopfield Net - Hamming Net - Multilayer perceptron - Kohonen s feature map - Boltzman Machines
MULTILAYER PERCEPTRON Single layer y...y m x...x n Linear discriminants: - Cannot solve problems ith nonlinear decision boundaries x x 2 XOR problem No linear solution exists
Multilayer Perceptron y...y m Hidden layer 2 Hidden layer x...x n Fully connected multilayer perceptron It as shon that a MLP ith 2 hidden layers can solve any decision boundaries.
Learning in MLP: Found in mid 80 s. Bac Propagation Learning Algorithm - Start ith arbitrary eights 2- Present the learning samples one by one to inputs of the netor. If the netor outputs are not as desired (y for the corresponding output and 0 for the others) - adust eights starting from top level by trying to reduce the differences 3- Propagate adustments donards until you reach the bottom layer. 4- Continue repeating 2 & 3 for all samples again & again until all samples are correctly classified.
Example: AND for XOR - for others NOT AND -.7 -.4 OR GATE AND GATE Fires only hen X, X 2 X, X 2 or - Output of neurons: or - Output for X, X 2,- or -, - for other combinations.5 -.5 X X 2
( X X X + X X + X2)( XX2) ( X+ X2)( X+ X2) X 2 XOR X 2 2 + - Activation function -.7.4 Tae (NOT AND)
Expressive Poer of Multilayer Netors If e assume there are 3 layers as above, (input, output, one hidden layer). For classification, if there are c categories, d features and m hidden nodes. Each output is our familiar discriminant function. y y 2 y c y m d g ( X) f f i By alloing f to be continuous and varying, is the formula for the discriminant function for a 3-layer (one input, one hidden, one output) netor. (m number of nodes in the hidden layer) i x i + 0 + 0
Any continuous function can be implemented ith a layer 3 layer netor as above, ith sufficient number of hidden units. (Kolmogorov (957)). That means, any boundary can be implemented. Then, optimum bayes rule, (g(x) a posteriori probabilities) can be implemented ith such netor! g( X) In practice: 2n+ - Ho many nodes? - Ho do e mae it learn the eights ith learning samples? - Activity function? d I ( i i ( x i )) Bac-Propagation Algorithm Learning algorithm for multilayer feed-forard netor, found in 80 s by Rumelhart et al. at MIT.
We have a sample set (labeled). { X, X2,..., X n } z z c NN t,...,t c - Target output t i -high, t lo for i and X C i z,..., z c -actual output x x d We ant: Find W (eight vector) so that difference beteen the target output and the actual output is minimized. Criterion function 2 2 J( W) ( t z) T Z 2 2 is minimized for the given learning set.
The Bac Propagation Algorithm ors basically as follos - Arbitrary initial eights are assigned to all connections 2- A learning sample is presented at the input, hich ill cause arbitrary outputs to be peaed. Sigmoid nonlinearities are used to find the output of each node. 3- Topmost layer s eights are changed to force the outputs to desired values. 4- Moving don the layers, each layers eights are updated to force the desired outputs. 5- Iteration continues by using all the training samples many times, until a set of eights that ill result ith correct outputs for all learning samples are found. (or ) J(W) <θ The eights are changed according to the folloing criteria: If the node is any node and i is one of the nodes a layer belo (connected to node ), update i as follos i ( t+ ) ( t) + ηδ Xi i (Generalized delta rule) Where X is either the output of node I or is an input and δ z ( z)( t z) in case is an output node z is the output and t is the desired output at node.
If is an intermediate node, δ x ( x ) δ here x is the output of node. m zz i δ - hidden node x i In case of is output z gradient descent Ho do e get these updates? Apply gradient descent algorithm to the netor. 2 J( W) T Z 2
Gradient Descent: move toards the direction of the negative of the gradient of J(W) J W η W η - learning rate For each component i i J η ( t+ ) t () + i i i No e have to evaluate J for output and hidden nodes. But e can rite this as follos: i J i J α α i here α f i i X i
But α i No call X i J α δ Then, ηδ X i i No, ill vary depending on being an output node or hidden node. Output node use chain rule So, δ J net J z z net Derivative of the activation function (sigmoid) ( t z)( z)( z ) i ( t+ ) ( t) + η( z )( z )( t i z ) X i
Hidden Node: use chain rule again No So net J δ i i i net net y y J J i i, δ evaluate [ ] 2 2 i i z t y y J ( ) c c y z z t y z z J 2 c y net net z z t ) ( δ δ i c i i i X t t + + ) ( ) ( δ η