Introduction to Artificial Neural Network - theory, application and practice using WEKA- Anto Satriyo Nugroho, Dr.Eng

Introduction to Artificial Neural Netor - theory, application and practice using WEKA- Anto Satriyo Nugroho, Dr.Eng Center for Information & Communication Technology, Agency for the Assessment & Application of Technology (PTIK-BPPT) Email: asnugroho@gmail.com URL: http://asnugroho.net 1

Agenda 1. Brain, Biological neuron, Artificial Neuron 2. Perceptron 3. Multilayer Perceptron & Bacpropagation Algorithm 4. Application of neural netor 5. Practice using WEKA 6. Important & useful references

Brain vs Computer Brain Computer Informa(on Proc. Specializa(on Informa(on representa(on Lo speed, fuzzy, parallel Paern recogni(on Analog Fast, accurate, sequen(al Numerical computa(on Digital Num. of elements 10 billion ~ 10 6 Speed Slo (10 3 /s) Fast (10 9 /s) Performance improvement Memory Learning Associa(ve (distributed among the synapses) SoLare upgrade address

Biological Neuron 1.4 x 10 11 Structure Cell Body Dendrite Axon Synapse (10 3 ~10 4 )

Biological Neural Netor 1. Principal of neuron : collection, processing, dissemination of electrical signals 2. Information processing capacity of brain : from netor of the neurons

Mathematical Model of Neuron McCulloch & Pitts (1943) Input signal x 2 x 3 x 1... 2 3 1 f Output signal y y f # % $ n i1 & x i i ( ' x n n synapses f activation function Input signal can be considered as dendrites in the biological neuron Output signal can be considered as axon in the biological neuron

Components of a neuron Synapse Calculator of eighted input signals Activation Function y f # n % $ i1 x i i & ( '

Activation Function 1. Threshold function (Heaviside function) " $ f (v) # $ % 1 if v > 0 0 if v 0 1-1 -0.5 0 0.5 1 used by McCulloch & Pitts all-or-none characteristic

Activation Function 2. Pieceise-linear function $ & & & f (v) % & & & ' 1 v + 1 2 v + 1 2 > v > 1 2 1 v 1 2-1 -0.5 0 0.5 1-1 1

Activation Function 3. Sigmoid function 1 0.9 0.8 0.7 c4 c2 c1 1 1+ e f ( x) c. x 0.6 0.5 0.4 0.3 0.2 0.1 0-5 -3-1 1 3 5

Ho to calculate neuron s output (ithout bias)? Input : x 0 1 0 1 0.5-0.5 Heaviside Activation Function v 0 0.5 + 1 ( 0.5) 0.5 " $ f (v) # $ % 1 if v > 0 0 if v 0 f ( v) 0

Ho to calculate neuron s output (ith bias)? Input : 0 x 1 0 1 0.5-0.5-0.7 f ( v) 1 v ( 0 0.5 + 1 ( 0.5)) ( 0.7) 0.2 Heaviside Activation Function " $ f (v) # $ % 1 if v > 0 0 if v 0

Artificial Neural Netor 1. Architecture : ho the neurons are connected each other 1. Feed-forard netor 2. Recurrent netors 2. Learning Algorithm: ho the netor are trained to fit an input-output mapping/function LMS, Delta rule, Bacpropagation, etc.

Agenda 1. Brain, Biological neuron, Artificial Neuron 2. Perceptron 3. Multilayer Perceptron & Bacpropagation Algorithm 4. Application of Neural Netor 5. Practice using WEKA 6. Important & useful references

Cristopher M. Bishop: Pattern Recognition & Machine Learning, Springer, 2006, p.196

Perceptron Learning (taing of AND function as example) y x x 1 2 x x 1 2 y 0 0 0 1 0 1 0 1 0 0 1 1 1 0 1

Perceptron Learning (taing of AND function as example) Training set: 4 examples, each consists of 2 dimensional vector ( 0 0, 0), ( 0,0), ( 1 1,0), ( 1 0 1,1) teaching signal (desired output)

Input : x Output Learn by adusting eights to reduce error on training set. The square error for an example ith input x and true output (teaching signal) y is E 1 2 Err 2 1 2 ( y h W ( x)) 2

Gradient Descent Optimization ( t+ 1) ( t) ( t) α E( )

Weight Update rule Perform optimization search by gradient descent: E W Err Simple eight update rule: Err W n Err y g W x W 0 Err g'( in) x in n 0 W x W W + α Err g'( in) x

)) ( (1 ) ( 1 1 1 ) (1 ) ( ) (1 1 ) (1 ) (1 1 ) ( 2 2 2 x g x g e e e e e e e e d e x g dx d x x x x x x x x x + + + + + + x e x g + 1 1 ) ( What if e use Sigmoid function as g? Lie this!

Weight Update rule (using Sigmoid as Activation Function) Perform optimization search by gradient descent: E W Err Simple eight update rule: Err W n Err y g W x W 0 Err g'( in) x in n 0 W x W W + α Err g( in)(1 g( in)) x

Perceptron Learning Algorithm AIMA, p.742

Perceptron Learning Algorithm For (e1;e<n;e++) Input : x[e] g(in) Output calculation in n 0 W x [ e] g(in) Weight update Error calculation Err y[ e] g( in) W W + α Err g' ( in) x [ e]

AND function using Perceptron y x x 1 2 x x 1 2 y 0 0 0 0 1 0 1 0 0 1.0 1.0 1.5 1 1 1 Heaviside Activation Function f ( v) # 1 "! 0 if v > 0 if v 0

OR function using Perceptron y x x 1 2 x x 1 2 y 0 0 0 0 1 1 1 0 1 1.0 1.0 0.5 1 1 1 Heaviside Activation Function f ( v) # 1 "! 0 if v > 0 if v 0

Problem appears hen perceptron is used to learn NON-linear function Result of XOR MSE output iteration

x 1 x0 + 0.5 Class 0 1 Class1 x 1 x0 0.5 0 1

XOR x x y 1 2 0 0 0 0 1 1 1 0 1 1 1 0 x 1 x 2-5 -5 5 5-2.5 2.5-5 5-2.5 Non linear mapping can be realized by inserting a hidden layer. But the learning algorithm is not non until 1986.

Marvin Minsy (Cognitive Scientist) Seymour Papert ( MIT mathematician)

1986, Chap.8, pp.318-362, Learning Internal Representations by Error propagation David E.Rumelhart: A Scientific Biography http://.cnbc.cmu.edu/derprize/

Bacpropagation Learning 1. Input a datum from training set to the Input Layer, and calculate the output of each neuron in Hidden and Output Layer Forard pass Input layer Hidden layer Output layer X Input data

Bacpropagation Learning 2. Calculate the Error, that is the difference (Δ) beteen the output of neuron in output layer ith the desired value (teaching signal) Input layer Hidden layer Output layer Δ Teaching signal X Δ Input data Δ

Bacpropagation Learning 2. Calculate the Error, that is the difference (Δ) beteen the output of neuron in output layer ith the desired value (teaching signal) B Input data : an image of B Input layer Hidden layer A Output layer B C Output value : 0.5 Output value : 0.3 Output value : 0.1

Bacpropagation Learning 3. Using the Δ value, update the eight beteen Output-Hidden Layer, and Hidden-Input Layer Bacard pass Input layer Hidden layer Output layer Δ Teaching signal X Δ Input data Δ

Bacpropagation Learning 4. Repeat step 1 to step 3, until stopping criteria is satisfied. Stopping Criteria: - maximal epochs/iteration - MSE (Mean Square Error)

BP for 3 layers MLP Input Layer Input layer Output Layer layer Hidden Layer i x! I i H O i

Forard Pass (1) x! Input layer-hidden layer Input layer Layer Output Layer layer Hidden Layer i I i H O I i x i H net f ( net θ + ) 1+ i i I i 1 e net i bias

Forard Pass (2) Hidden layer-output layer x! Input layer Layer Output Layer layer Hidden Layer i I i H O O net f ( net θ 1 ) 1+ e + H net i

Bacard Pass 1: Hidden-Output Layer Hidden layer-output layer ne old + Δ x! Input layer Layer I i Output Layer layer Hidden Layer i H i O Error (MSE:Mean Square Error) Δ 1 2 E ( t O ) 2 Δ Δ Teaching signal δ t O ) O (1 O ) Δ Δ ( Weight update E η ηδ H Learning rate

Error is given by E 1 2 ( t O 2 ) Modification of eights beteen Output and Hidden Layer due to the error E is calculated as follos: E O O net net ( t O 1 (1 + e H net ) ) e O (1 O net 2 )

E ( t O ) O (1 O ) H δ H here δ ( t O ) O (1 O ) Thus, the eight correction is obtained as follos Δ η is the learning rate E η ηδ H

Bacard Pass 2: Input-Hidden Layer Hidden layer-input layer ne old + Δ i x! Input layer Layer i I i Output Layer layer Hidden Layer H O Δ Δ Δ Δ Teaching signal Weight update E Δi η ηδ xi i δ H (1 H ) δ i

i i net net H net O O O t net O O E x net H H e e net H + δ ) (1 ) ( ) (1 ) (1 1 2 i i i H net net O O E net net H net net H H net net O O E E The eight correction beteen Hidden and Input layer are determined using the similar ay.

i i i i i x I H H I I H H E δ δ δ δ ) (1 ) ( ) (1 hence H H δ δ ) (1 here The correction of eight vector is i i i x E ηδ η Δ

Output-Hidden Layer Δ ( t) ηδ H Momentum Δ ( t) ηδ H + αδ ( t 1) Hidden-Input Layer Δ ( t) Δ i i ηδ x i ( t) ηδ x + αδ ( t 1) i i Add inertia to the motion through eight space, preventing the oscillation

Training Process: Forard Pass 1. Calculate the Output of Input Layer I i x i 2. Calculate the Output of Hidden Layer net θ + i i I i H f ( net ) 1+ 1 e net 3. Calculate the Output of Output Layer net θ + H O f ( net ) 1+ 1 e net

Training Process: Bacard Pass 1. Calculate the of Output Layer 2. Update the eight beteen Hidden & Output Layer Δ, δ ηδ H δ δ O ( 1 O )( t O ) 3. Calculate the of Hidden layer, ( ne), ( old) + Δ, δ H (1 H ), δ 4. Update the eight beteen Input & Hidden Layer, + Δ, Δ i ηδ I, i i( ne), i( old) i

1 Implementation of Neural Netor for Handriting Numeral Recognition System in Facsimile Autodialing System Hand-ritten Auto-dialing Facsimile(SFX-70CL) 123-456-7890 Facsimile Form To Mr.Tanaa 2Insert the draft Facsimile Form XXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXX 123-456-7890 4Auto-dialing H.Kaairi 3Dial number ill be recognized and displayed 5Sending the draft 1Write the dial number at the head of the facsimile draft Related Publication: Hand-ritten Numeric Character Recognition for Facsimile Auto-dialing by Large Scale Neural Netor CombNET-II, Proc. of 4th.International Conference on Engineering Applications of NeuralNetors, pp.40-46, June 10-12,1998, Gibraltar

2 Automatic System for locating characters Using Stroe Analysis Neural Netor Application : - Robot Eyes - Support System for Visually Handicapped Find the text region Character recognition Input Image Camera Text to Speech Synthesizer Related Publication: An algorithm for locating characters in color image using stroe analysis neural netor, Proc. of the 9th International Conference on Neural Information Processing (ICONIP 02), Vol.4, pp.2132-2136, November 18-22, 2002, Singapore

3 Fog Forecasting by large scale neural netor CombNET-II l Predicting fog event based on meteorological observation l The prediction as held every 30 minutes and the result as used to support aircraft navigation l The number of fog events as very small compared to no fog events hich can be considered as a pattern classification problem involving imbalanced training sets l Observation as held every 30 minutes, in Long.141.70 E, 42.77 Lat., 25 m above sea level by Shin Chitose Meteorological Observatory Station (Hoaido Island, Japan) l Fog Event is defined for condition here l Range of Visibility < 1000 m l Weather shos the appearance of the fog l Winner of the competition (1999)

Observed Information No. Meteorological Information No. Meteorological Information 1 2 3 4 5 6 7 8 9 10 11 12 13 Year Month Date Time Atmospheric Pressure [hpa] Temperature [ o C] De Point Temperature [ o C] Wind Direction [ o ] Wind Speed [m/s] Max.Inst.Wind Speed [m/s] Change of Wind (1) [ o ] Change of Wind (1) [ o ] Range of Visibility 14 15 16 17 18 19 20 21 22 23 24 25 26 Weather Cloudiness (1 st layer) Cloud Shape (1 st layer) Cloud Height (1 st layer) Cloudiness (2 st layer) Cloud Shape (2 st layer) Cloud Height (2 st layer) Cloudiness (3 st layer) Cloud Shape (3 st layer) Cloud Height (3 st layer) Cloudiness (4 st layer) Cloud Shape (4 st layer) Cloud Height (4 st layer) Example : 1984 1 1 4.5 1008 0.0 7.0 270 6 1 1 1 9999 85 0 2 10 0 4 25 1 1 1 1 1 1

Result of 1999 Fog Forecasting Contest Problem: given the complete observation data of 1984-1988, 1990-1994 for designing the model, then predict the appearance of fog-event during 1989 and 1995 Proposed Method CombNET-II Probabilistic NN Modified Counter Propagation NN Fog Events (539 correct) Predictions Correctly Pred. 622 374 169 127 908 178 Num. of false prediction. 370 445 734 This study on the first prize aard in the 1999 Fog Forecasting Contest sponsored by Neurocomputing Technical Group of IEICE-Japan

Achievements This study on the first prize aard in the 1999 Fog Forecasting Contest sponsored by Neurocomputing Technical Group of IEICE-Japan Related Publications: 1. A Solution for Imbalanced Training Sets Problem by CombNET-II and Its Application on Fog Forecasting, IEICE Trans. on Information & Systems, Vol.E85-D, No.7, pp.1165-1174, July 2002 2. Mathematical perspective of CombNET and its application to meteorological prediction, Special Issue of Meteorological Society of Japan on Mathematical Perspective of Neural Netor and its Application to Meteorological Problem, Meteorological Research Note, No.203, pp.77-107, October 2002

4 NET Tal http://.cnl.sal.edu/parallelnetspronounce/index.php T.J. Senosi and C.R. Rosenberg : a parallel netor that learns to read aloud, Cognitive Science, 14:179-211, 1990. Simulation: Continuous Informal Speech pp.194-203 Netor architecture: 203-120-26 (trained in 30,000 iterations) Text (1000 ords) THE OF AND TO IN etc Input Layer I i Output Layer Hidden Layer i H O Output: phoneme (accuracy 98%) i

5 Handriting Digit Recognition http://yann.lecun.com/exdb/mnist/ MNIST database consists of 60,000 examples as training set, and 10,000 examples as testing set Linear Classifier: 8.4% error K-Nearest Neighbor Classifier, L3: 1.22% error SVM Gaussian ernel: 1.4% SVM deg.4 polynomial : 1.1% error 2 layer ANN ith 800 hidden units: 0.9% error Currently (26 October 2009) the best accuracy is achieved using Large Convolutional Netor (0.39% error)

Flo of an AI experiment Training Set Model fitting Validation Set Error estimation Of selected model AI model applied to the real orld Testing Set Generalization assessment Of the final chosen model

Ho to mae experiment using ANN? Step 1 Prepare three data set hich is independent each other: Training Set, Validation Set and Testing Set. Step 2 Train the neural netor using initial parameter setting : - stopping criteria (training is stopped if exceeded t iteration OR MSE is loer than z) - num. of hidden neuron - learning rate - momentum

Ho to mae experiment using ANN? Step 3 Evaluate the performance of the initial model by measuring its accuracy to the validation set Step 4 Change the parameter and repeat step 2 and step 3 until satisfied result achieved. Step 5 Evaluate the performance of the neural netor by measuring its accuracy to the testing set

Performance Evaluation Training set: model fitting Validation set: estimation of prediction error for model selection Testing set: assessment of generalization error of the final chosen model Train Validation Test

Important & Useful References for Neural Netor Neural Netors for Pattern Recognition, Christopher M. Bishop, Oxford University Press, 1995 Neural Netor Comprehensive Foundation (2nd edition), Simon Hayin, Prentice Hall, 1998 Pattern Classification, Richard O. Duda, Peter E. Hart, David G. Stor, John Wiley & Sons Inc, 2000 Artificial Intelligence: A Modern Approach, Stuart J. Russell, Peter Norvig, Prentice Hall, 2002 Introduction to Data Mining, Pang Ning Tan, Michael Steinbach, Vipin Kumar, Addison Wesley, 2006 Data Mining: Practical Machine Learning Tools and Techniques (Second Edition), Ian H. Witten, Eibe Fran, Morgan Kaufmann, June 2005 FAQ Neural Netor ftp://ftp.sas.com/pub/neural/faq.html Bacpropagator s revie http://.dontveter.com/bpr/bpr.html UCI Machine Learning Repository http://archive.ics.uci.edu/ml/index.html WEKA: http://.cs.aiato.ac.nz/~ml/ea/ Kangaroos and Training Neural Netors: http://.sasenterpriseminer.com/ documents/kangaroos%20and%20training%20neural%20netors.txt